Shamash - Autoscaling for Dataproc

Shamash is a service for autoscaling Cloud DataProc on Google Cloud Platform(GCP).

Shamash was the god of justice in Babylonia and Assyria, just like the Shamash auto-scaler whose job is to maintain a tradeoff between costs and performance.

Background

Cloud Dataproc is a fast, easy-to-use, fully-managed cloud service for running Apache Spark and Apache Hadoop clusters in a simpler, more cost-efficient way. Operations that used to take hours or days take seconds or minutes instead, and you pay only for the resources you use (with per-second billing).

Cloud Dataproc also easily integrates with other Google Cloud Platform (GCP) services, giving you a powerful and complete platform for data processing, analytics, and machine learning.

Due to different usage patterns (e.g. high load during work hours, no usage overnight), the cluster may become either underprovisioned (user experience lousy performance) or overprovision (cluster is idle causing a waste of resources and unnecessary costs).

However, while autoscaling has become state-of-the-art for applications in GCP, currently there exists no out-of-the-box solution for autoscaling of Dataproc clusters.

The Shamash autoscaling tool actively monitors the performance of Dataproc clusters and automatically scales the cluster up and down where appropriate. Shamash adds and removes nodes based on the current load of the cluster.

We built Shamash on top of Google App Engine utilizing a serverless architecture.

Highlights

Serverless operation
Support multiple clusters (each with his own configuration)
Works without any change to the cluster
Low TOC

Installation

Shamash requires both Google Compute Engine, Google Cloud Pub/Sub, Dataproc API and Stackdriver APIs to be enabled to operate properly.

To enable an API for your project:

Go to the Cloud Platform Console.
From the projects list, select a project or create a new one.
If the APIs & services page isn't already open, open the console left side menu and choose APIs & services, and then select Library.
Click the API you want to enable. ...
Click ENABLE.

Install dependencies

pip install -r requirements.txt -t lib

Deploy

./deploy.sh project-id

Configuration

Cluster — Google Dataproc Cluster Name
Region — Cluster Region
PreemptiblePct — The ratio of preemptible workers in Dataproc cluster
ContainerPendingRatio — The ratio of pending containers allocated to trigger scale out event of the cluster (UpContainerPendingRatio = yarn-containers-pending / yarn-containers-allocated). If yarn-containers-allocated = 0, then ContainerPendingRatio = yarn-containers-pending.
UpYARNMemAvailPct — The percentage of remaining memory available to YARN to trigger
DownYARNMemAvailePct — The percentage of remaining memory available to YARN to trigger scale down

YARNMemAvailePct is calculated using the following formula yarn-memory-mb-available + yarn-memory-mb-allocated = Total cluster memory. YARNMemAvailePct = yarn_memory_mb_available / Total Cluster Memory
MinInstances - The least number of workers the cluster will contain, even if the target is not met
MaxInstances — The largest number of workers allowed, even if the target is exceeded

Architecture

Flow

Every 2 minutes a cron job calls /tasks/check-load which create a task per cluster in the task queue.
Each task is requesting /monitor with the cluster name as a parameter.
/monitor calls check_load()
check_load() get the data from the cluster and publishes it to pub/subpubsub.publish(pubsub_client, msg, MONITORING_TOPIC)
/get_monitoring_data is invoked when there is a new message in the monitoring topic and calls /should_scale
should_scale decide if the cluster has to be rescaled. If yes, trigger_scaling which put data into pub/sub scaling topic
/scale invokes, gets the message from pub/sub and calls do_scale
Once the calculations are done Shamash will patch the cluster with a new number of nodes.

Visualization

We didn’t build any visualization into Shamash, however, since all metrics are reported to Stackdriver, you can build a dashboard that will show you the metrics which Shamash is tracking, as well as the number of nodes, number of workers and preemptible workers.

The metrics names are: ContainerPendingRatio YARNMemoryAvailablePercentage YarnNodes Workers PreemptibleWorkers

Local Development

For local development run:

dev_appserver.py --log_level=debug app.yaml

You will need a local config.json file in the following structure:

{ "project": "project-id" }

Contributing

We invite everyone to take part in improving it by submitting issues and pull requests.

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

doitintl / shamash

Programming Languages

Labels

Projects that are alternatives of or similar to shamash