Today, data comes in many shapes and sizes. When trying to leverage insights generated by “big data”, a Spark cluster can provide significant lift in efficiently distributing computation across different nodes. Standing up a Spark cluster on your own, however, is not an easy task, and can be a source of frustration for both Data Scientists and DevOps engineers. Skafos provides you with a configure-less, auto-scalable Spark cluster that works underneath the Data Engine to help wrangle data in your pipeline. All that is required is the inclusion of the
spark-cluster AddOn in your configuration file. When your job runs, the spark cluster will automatically scale up based on your job's work load.
Spark is a cluster compute framework for data processing which means it operates as a single unit by utilizing a group of server instances to perform a task in an efficient manner. A spark cluster has a few important features that provide that extra boost for your data processing:
Speed - Network traffic is reduced through partitions that create a parallelized and distributed data processing system.
Lazy Evaluation - Lazy Evaluation contributes to speed by evaluating only when necessary. Computations are added to a Directed Acyclic Graph (ADG) representing task stages that are executed upon request.
Real Time Processing - Spark performs in-memory computation decreasing lag time.
A Spark Cluster is accessible to your jobs by working underneath the Data Engine to help wrangle data. Before you can start using this tool, you need to define the
spark-cluster AddOn in your project's
metis.config.yml file. Using the spark cluster to perform queries and transform data requires using the Data Engine, accessed through the SDK.
project_token<project_token> namemy_new_project jobs job_id<job_id> languagepython nameMain entrypoint"main.py" add-ons namespark-cluster
All of your Projects within the same organization using the spark-cluster will share the resources available from the spark cluster. As resources are needed, the Spark cluster will automatically scale with a max memory of 140 GB and 40 CPUs. If your project needs more resources just ask!