All Projects → doitintl → shamash

doitintl / shamash

Licence: MIT License
Autoscaling for Google Cloud Dataproc

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to shamash

Devops Python Tools
80+ DevOps & Data CLI Tools - AWS, GCP, GCF Python Cloud Function, Log Anonymizer, Spark, Hadoop, HBase, Hive, Impala, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Travis CI, AWS CloudFormation, Elasticsearch, Solr etc.
Stars: ✭ 406 (+1209.68%)
Mutual labels:  spark, gcp
opal
Policy and data administration, distribution, and real-time updates on top of Open Policy Agent
Stars: ✭ 459 (+1380.65%)
Mutual labels:  gcp, pubsub
Elasticluster
Create clusters of VMs on the cloud and configure them with Ansible.
Stars: ✭ 298 (+861.29%)
Mutual labels:  spark, gcp
Seldon Server
Machine Learning Platform and Recommendation Engine built on Kubernetes
Stars: ✭ 1,435 (+4529.03%)
Mutual labels:  spark, gcp
pubsub cli
super handy google cloud Pub/Sub CLI
Stars: ✭ 32 (+3.23%)
Mutual labels:  gcp, pubsub
kane
Google Pub/Sub client for Elixir
Stars: ✭ 92 (+196.77%)
Mutual labels:  gcp, pubsub
iris3
An upgraded and improved version of the Iris automatic GCP-labeling project
Stars: ✭ 38 (+22.58%)
Mutual labels:  gcp, pubsub
terraform-splunk-log-export
Deploy Google Cloud log export to Splunk using Terraform
Stars: ✭ 26 (-16.13%)
Mutual labels:  gcp, pubsub
spark-druid-olap
Sparkline BI Accelerator provides fast ad-hoc query capability over Logical Cubes. This has been folded into our SNAP Platform(http://bit.ly/2oBJSpP) an Integrated BI platform on Apache Spark.
Stars: ✭ 286 (+822.58%)
Mutual labels:  spark
openverse-catalog
Identifies and collects data on cc-licensed content across web crawl data and public apis.
Stars: ✭ 27 (-12.9%)
Mutual labels:  spark
plantuml-libs
A set of PlantUML libraries and a NPM cli tool to design diagrams which focus on several technologies/approaches: Amazon Web Services (AWS), Azure, Google Cloud Platform (GCP), C4 Model or even EventStorming and more.
Stars: ✭ 75 (+141.94%)
Mutual labels:  gcp
liftbridge-api
Protobuf definitions for the Liftbridge gRPC API. https://github.com/liftbridge-io/liftbridge
Stars: ✭ 15 (-51.61%)
Mutual labels:  pubsub
cloud-desktops
Cloud-based Virtual Desktops on Google Cloud Platform
Stars: ✭ 14 (-54.84%)
Mutual labels:  gcp
mros2
agent-less and lightweight communication library compatible with rclcpp for embedded devices
Stars: ✭ 72 (+132.26%)
Mutual labels:  pubsub
Bank-Note-Authentication
💸 Authenticate Bank Notes on the basis of Genuity and Forged using Sklearn and deployed on Heroku and FastAPI Server 💳 💲
Stars: ✭ 17 (-45.16%)
Mutual labels:  gcp
ODSC India 2018
My presentation at ODSC India 2018 about Deep Learning with Apache Spark
Stars: ✭ 26 (-16.13%)
Mutual labels:  spark
eks-hpa-profile
An eksctl gitops profile for autoscaling with Prometheus metrics on Amazon EKS on AWS Fargate
Stars: ✭ 26 (-16.13%)
Mutual labels:  auto-scaling
php-pubsub
A PHP abstraction for the pub-sub pattern
Stars: ✭ 36 (+16.13%)
Mutual labels:  pubsub
yuzhouwan
Code Library for My Blog
Stars: ✭ 39 (+25.81%)
Mutual labels:  spark
bluechatter
Deploy & Scale a chat app using Cloud Foundry, Docker Container and Kubernetes
Stars: ✭ 64 (+106.45%)
Mutual labels:  auto-scaling

Shamash - Autoscaling for Dataproc

License GitHub stars

Shamash is a service for autoscaling Cloud DataProc on Google Cloud Platform(GCP).

Blog Post

Shamash was the god of justice in Babylonia and Assyria, just like the Shamash auto-scaler whose job is to maintain a tradeoff between costs and performance.

Background

Cloud Dataproc is a fast, easy-to-use, fully-managed cloud service for running Apache Spark and Apache Hadoop clusters in a simpler, more cost-efficient way. Operations that used to take hours or days take seconds or minutes instead, and you pay only for the resources you use (with per-second billing).

Cloud Dataproc also easily integrates with other Google Cloud Platform (GCP) services, giving you a powerful and complete platform for data processing, analytics, and machine learning.

Due to different usage patterns (e.g. high load during work hours, no usage overnight), the cluster may become either underprovisioned (user experience lousy performance) or overprovision (cluster is idle causing a waste of resources and unnecessary costs).

However, while autoscaling has become state-of-the-art for applications in GCP, currently there exists no out-of-the-box solution for autoscaling of Dataproc clusters.

The Shamash autoscaling tool actively monitors the performance of Dataproc clusters and automatically scales the cluster up and down where appropriate. Shamash adds and removes nodes based on the current load of the cluster.

We built Shamash on top of Google App Engine utilizing a serverless architecture.

Highlights

  • Serverless operation
  • Support multiple clusters (each with his own configuration)
  • Works without any change to the cluster
  • Low TOC

Installation

Shamash requires both Google Compute Engine, Google Cloud Pub/Sub, Dataproc API and Stackdriver APIs to be enabled to operate properly.

To enable an API for your project:

  1. Go to the Cloud Platform Console.
  2. From the projects list, select a project or create a new one.
  3. If the APIs & services page isn't already open, open the console left side menu and choose APIs & services, and then select Library.
  4. Click the API you want to enable. ...
  5. Click ENABLE.
Install dependencies

pip install -r requirements.txt -t lib

Deploy

./deploy.sh project-id

Configuration

  • Cluster — Google Dataproc Cluster Name

  • Region — Cluster Region

  • PreemptiblePct — The ratio of preemptible workers in Dataproc cluster

  • ContainerPendingRatio — The ratio of pending containers allocated to trigger scale out event of the cluster (UpContainerPendingRatio = yarn-containers-pending / yarn-containers-allocated). If yarn-containers-allocated = 0, then ContainerPendingRatio = yarn-containers-pending.

  • UpYARNMemAvailPct — The percentage of remaining memory available to YARN to trigger

  • DownYARNMemAvailePct — The percentage of remaining memory available to YARN to trigger scale down

    YARNMemAvailePct is calculated using the following formula yarn-memory-mb-available + yarn-memory-mb-allocated = Total cluster memory. YARNMemAvailePct = yarn_memory_mb_available / Total Cluster Memory

  • MinInstances - The least number of workers the cluster will contain, even if the target is not met

  • MaxInstances — The largest number of workers allowed, even if the target is exceeded

Architecture

Flow

  • Every 2 minutes a cron job calls /tasks/check-load which create a task per cluster in the task queue.
  • Each task is requesting /monitor with the cluster name as a parameter.
  • /monitor calls check_load()
  • check_load() get the data from the cluster and publishes it to pub/subpubsub.publish(pubsub_client, msg, MONITORING_TOPIC)
  • /get_monitoring_data is invoked when there is a new message in the monitoring topic and calls /should_scale
  • should_scale decide if the cluster has to be rescaled. If yes, trigger_scaling which put data into pub/sub scaling topic
  • /scale invokes, gets the message from pub/sub and calls do_scale
  • Once the calculations are done Shamash will patch the cluster with a new number of nodes.

Visualization

We didn’t build any visualization into Shamash, however, since all metrics are reported to Stackdriver, you can build a dashboard that will show you the metrics which Shamash is tracking, as well as the number of nodes, number of workers and preemptible workers.

The metrics names are: ContainerPendingRatio YARNMemoryAvailablePercentage YarnNodes Workers PreemptibleWorkers

Local Development

For local development run:

dev_appserver.py --log_level=debug app.yaml

You will need a local config.json file in the following structure:

{ "project": "project-id" }

Contributing

We invite everyone to take part in improving it by submitting issues and pull requests.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].