All Projects → mozilla → emr-bootstrap-spark

mozilla / emr-bootstrap-spark

Licence: other
AWS bootstrap scripts for Mozilla's flavoured Spark setup.

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects
Makefile
30231 projects

Projects that are alternatives of or similar to emr-bootstrap-spark

mercury
Mercury - data visualize and discovery with Javascript, such as apache zeppelin and jupyter
Stars: ✭ 29 (-40.82%)
Mutual labels:  jupyter, zeppelin
swift-colab
Swift kernel for Google Colaboratory
Stars: ✭ 50 (+2.04%)
Mutual labels:  jupyter
itikz
Cell and line magic for PGF/TikZ-to-SVG rendering in Jupyter notebooks
Stars: ✭ 55 (+12.24%)
Mutual labels:  jupyter
ipython pytest
Pytest magic for IPython notebooks
Stars: ✭ 33 (-32.65%)
Mutual labels:  jupyter
workshop
Workshop: Micromagnetics with Ubermag
Stars: ✭ 19 (-61.22%)
Mutual labels:  jupyter
colour-notebooks
Colour - Jupyter Notebooks
Stars: ✭ 21 (-57.14%)
Mutual labels:  jupyter
zeppelin-spark-cassandra-demo
A demo explaining how to use Zeppelin notebook to access Apache Cassandra data via Apache Spark or CQL language
Stars: ✭ 17 (-65.31%)
Mutual labels:  zeppelin
mercury
Convert Python notebook to web app and share with non-technical users
Stars: ✭ 1,894 (+3765.31%)
Mutual labels:  jupyter
ijava-binder
An IJava binder base for trying the Java Jupyter kernel on https://mybinder.org/
Stars: ✭ 28 (-42.86%)
Mutual labels:  jupyter
Python-Course
🐍 This is the most complete course in Python, completely practical and all the lessons are explained with examples, so that they can be easily understood. 🍫
Stars: ✭ 18 (-63.27%)
Mutual labels:  jupyter
callisto
A command line utility to create kernels in Jupyter from virtual environments.
Stars: ✭ 15 (-69.39%)
Mutual labels:  jupyter
pydna
Clone with Python! Data structures for double stranded DNA & simulation of homologous recombination, Gibson assembly, cut & paste cloning.
Stars: ✭ 109 (+122.45%)
Mutual labels:  jupyter
picatrix
Picatrix is a library designed to help security analysts in a notebook environment, such as colab or jupyter.
Stars: ✭ 35 (-28.57%)
Mutual labels:  jupyter
drawdata
Draw datasets from within Jupyter.
Stars: ✭ 500 (+920.41%)
Mutual labels:  jupyter
observable-jupyter
Embed visualizations and code from Observable notebooks in Jupyter
Stars: ✭ 27 (-44.9%)
Mutual labels:  jupyter
ipyp5
p5.js Jupyter Widget
Stars: ✭ 33 (-32.65%)
Mutual labels:  jupyter
visualizing-geodata folium-bokeh-demo-
folium, bokeh, jupyter, python
Stars: ✭ 17 (-65.31%)
Mutual labels:  jupyter
leafmap
A Python package for interactive mapping and geospatial analysis with minimal coding in a Jupyter environment
Stars: ✭ 1,299 (+2551.02%)
Mutual labels:  jupyter
iqsharp
Microsoft's IQ# Server.
Stars: ✭ 112 (+128.57%)
Mutual labels:  jupyter
jupyterlab-desktop
JupyterLab desktop application, based on Electron.
Stars: ✭ 1,950 (+3879.59%)
Mutual labels:  jupyter

emr-bootstrap-spark

This package contains the AWS bootstrap scripts for Mozilla's flavoured Spark setup. The deployed scripts in S3 are referenced by ATMO clusters and Airflow jobs.

Interactive job

export SPARK_PROFILE=telemetry-spark-cloudformation-TelemetrySparkInstanceProfile-1SATUBVEXG7E3
export SPARK_BUCKET=telemetry-spark-emr-2
export KEY_NAME=20161025-dataops-dev
aws emr create-cluster \
  --region us-west-2 \
  --name SparkCluster \
  --instance-type c3.4xlarge \
  --instance-count 1 \
  --service-role EMR_DefaultRole \
  --ec2-attributes KeyName=${KEY_NAME},InstanceProfile=${SPARK_PROFILE} \
  --release-label emr-5.2.1 \
  --applications Name=Spark Name=Hive Name=Zeppelin \
  --bootstrap-actions Path=s3://${SPARK_BUCKET}/bootstrap/telemetry.sh \
  --configurations https://s3-us-west-2.amazonaws.com/${SPARK_BUCKET}/configuration/configuration.json \
  --steps Type=CUSTOM_JAR,Name=CustomJAR,ActionOnFailure=TERMINATE_JOB_FLOW,Jar=s3://us-west-2.elasticmapreduce/libs/script-runner/script-runner.jar,Args=\["s3://${SPARK_BUCKET}/steps/zeppelin/zeppelin.sh"\]

Batch job

# Also export the vars from the 'interactive' section above.
export DATA_BUCKET=telemetry-public-analysis-2 # Or use the private bucket.
export CODE_BUCKET=telemetry-analysis-code-2
aws emr create-cluster \
  --region us-west-2 \
  --name SparkCluster \
  --instance-type c3.4xlarge \
  --instance-count 1 \
  --service-role EMR_DefaultRole \
  --ec2-attributes KeyName=${KEY_NAME},InstanceProfile=${SPARK_PROFILE} \
  --release-label emr-5.2.1 \
  --applications Name=Spark Name=Hive \
  --bootstrap-actions Path=s3://${SPARK_BUCKET}/bootstrap/telemetry.sh \
  --configurations https://s3-us-west-2.amazonaws.com/${SPARK_BUCKET}/configuration/configuration.json \
  --auto-terminate \
  --steps Type=CUSTOM_JAR,Name=CustomJAR,ActionOnFailure=TERMINATE_JOB_FLOW,Jar=s3://us-west-2.elasticmapreduce/libs/script-runner/script-runner.jar,Args=\["s3://${SPARK_BUCKET}/steps/batch.sh","--job-name","foo","--notebook","s3://${CODE_BUCKET}/jobs/foo/Telemetry Hello World.ipynb","--data-bucket","${DATA_BUCKET}"\]

Deploy to AWS via ansible

To deploy to the staging location:

ansible-playbook ansible/deploy.yml -e '@ansible/envs/stage.yml' -i ansible/inventory

Once deployed, you can see the effects in action by launching a cluster via ATMO stage.

To deploy for production clusters:

ansible-playbook ansible/deploy.yml -e '@ansible/envs/production.yml' -i ansible/inventory

The Spark Jupyter notebook configuration is hosted at https://s3-us-west-2.amazonaws.com/telemetry-spark-emr-2/credentials/jupyter_notebook_config.py. At the moment, this is only needed for the GitHub Gist export option in the Jupyter notebook. The credentials it contains are managed under the Mozilla GitHub account by :whd. This file should not be made public.

Contributing to emr-bootstrap-spark

You may set up a development environment to test and verify modifications applied to this repository.

Install prerequisite packages

pip install ansible boto boto3

Create and bootstrap the development environment

  • Define a new ansible environment in env/dev-<username>.yml
    • Set spark_emr_bucket to a unique bucket e.g. telemetry-spark-emr-2-dev-<username>
    • Set stack_name to a unique name e.g. telemetry-spark-cloudformation-dev-<username>
  • Recursively copy assets from staging to dev
    • aws s3 cp --recursive s3://telemetry-spark-emr-2-stage s3://telemetry-spark-emr-2-dev-<username>
  • Deploy to AWS using ansible-playbook on the new environment
  • Launch a new instance using the appropriate SPARK_PROFILE and SPARK_BUCKET keys
    • Set SPARK_PROFILE to the cloudformation instance profile
      • This can be found as an output on the cloudformation dashboard
      • Alternatively:
           aws cloudformation describe-stacks --stack-name telemetry-spark-cloudformation-dev-<username> |
           jq '.Stacks[0].Outputs[0].OutputValue'
        
    • Set SPARK_BUCKET to spark_emr_bucket value in env/dev-<username>.yml
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].