Metis Machine's Skafos

Machine Learning Delivered. A Machine Learning deployment platform built to unite Data Scientist, DevOps and Engineering.

Welcome to the Metis Machine documentation hub. You'll find comprehensive guides and documentation to help you start working with Metis Machine's Skafos platform as quickly as possible, as well as support if you get stuck. Fire it up!

Get Started    

Spark Cluster

Today, data comes in many shapes and sizes. When trying to leverage insights generated by “big data”, a Spark cluster can provide significant lift in efficiently distributing computation across different nodes. Standing up a Spark cluster on your own, however, is not an easy task, and can be a source of frustration for both Data Scientists and DevOps engineers. Skafos provides you with a configure-less, auto-scalable Spark cluster that works underneath the Data Engine to help wrangle data in your pipeline. All that is required is the inclusion of the spark-cluster AddOn in your configuration file. When your job runs, the spark cluster will automatically scale up based on your job's work load.

What is a Spark Cluster?

Spark is a cluster compute framework for data processing which means it operates as a single unit by utilizing a group of server instances to perform a task in an efficient manner. A spark cluster has a few important features that provide that extra boost for your data processing:

  • Speed - Network traffic is reduced through partitions that create a parallelized and distributed data processing system.

  • Lazy Evaluation - Lazy Evaluation contributes to speed by evaluating only when necessary. Computations are added to a Directed Acyclic Graph (ADG) representing task stages that are executed upon request.

  • Real Time Processing - Spark performs in-memory computation decreasing lag time.

Usage

A Spark Cluster is accessible to your jobs by working underneath the Data Engine to help wrangle data. Before you can start using this tool, you need to define the spark-cluster AddOn in your project's metis.config.yml file. Using the spark cluster to perform queries and transform data requires using the Data Engine, accessed through the SDK.

project_token: <project_token>
name: my_new_project
jobs: 
  - job_id: <job_id>
    language: python
    name: Main
    entrypoint: "main.py"
add-ons:
  - name: spark-cluster

Spark Cluster Available Resources

All of your Projects within the same organization using the spark-cluster will share the resources available from the spark cluster. As resources are needed, the Spark cluster will automatically scale with a max memory of 140 GB and 40 CPUs. If your project needs more resources just ask!