All Projects → Azure → Aztk

Azure / Aztk

Licence: mit
AZTK powered by Azure Batch: On-demand, Dockerized, Spark Jobs on Azure

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Aztk

Horovod
Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.
Stars: ✭ 11,943 (+7757.24%)
Mutual labels:  spark
Azure Event Hubs Spark
Enabling Continuous Data Processing with Apache Spark and Azure Event Hubs
Stars: ✭ 140 (-7.89%)
Mutual labels:  spark
Spark Cassandra Connector
DataStax Spark Cassandra Connector
Stars: ✭ 1,816 (+1094.74%)
Mutual labels:  spark
Spark On Lambda
Apache Spark on AWS Lambda
Stars: ✭ 137 (-9.87%)
Mutual labels:  spark
Ecommercerecommendsystem
商品大数据实时推荐系统。前端:Vue + TypeScript + ElementUI,后端 Spring + Spark
Stars: ✭ 139 (-8.55%)
Mutual labels:  spark
Rasterframes
Geospatial Raster support for Spark DataFrames
Stars: ✭ 142 (-6.58%)
Mutual labels:  spark
Iot Traffic Monitor
Stars: ✭ 131 (-13.82%)
Mutual labels:  spark
Cc Pyspark
Process Common Crawl data with Python and Spark
Stars: ✭ 147 (-3.29%)
Mutual labels:  spark
Laravel Azure Storage
Microsoft Azure Blob Storage integration for Laravel's Storage API
Stars: ✭ 139 (-8.55%)
Mutual labels:  azure-storage
Technology Talk
汇总java生态圈常用技术框架、开源中间件,系统架构、数据库、大公司架构案例、常用三方类库、项目管理、线上问题排查、个人成长、思考等知识
Stars: ✭ 12,136 (+7884.21%)
Mutual labels:  spark
Quicksql
A Flexible, Fast, Federated(3F) SQL Analysis Middleware for Multiple Data Sources
Stars: ✭ 1,821 (+1098.03%)
Mutual labels:  spark
Sparkling Graph
SparklingGraph provides easy to use set of features that will give you ability to proces large scala graphs using Spark and GraphX.
Stars: ✭ 139 (-8.55%)
Mutual labels:  spark
Spark Authorizer
A Spark SQL extension which provides SQL Standard Authorization for Apache Spark
Stars: ✭ 141 (-7.24%)
Mutual labels:  spark
Apache Spark Node
Node.js bindings for Apache Spark DataFrame APIs
Stars: ✭ 136 (-10.53%)
Mutual labels:  spark
Datacompy
Pandas and Spark DataFrame comparison for humans
Stars: ✭ 147 (-3.29%)
Mutual labels:  spark
Aliyun Emapreduce Datasources
Extended datasource support for Spark/Hadoop on Aliyun E-MapReduce.
Stars: ✭ 132 (-13.16%)
Mutual labels:  spark
Data science blogs
A repository to keep track of all the code that I end up writing for my blog posts.
Stars: ✭ 139 (-8.55%)
Mutual labels:  spark
Spark With Python
Fundamentals of Spark with Python (using PySpark), code examples
Stars: ✭ 150 (-1.32%)
Mutual labels:  spark
Pyspark Learning
Updated repository
Stars: ✭ 147 (-3.29%)
Mutual labels:  spark
Nd4j
Fast, Scientific and Numerical Computing for the JVM (NDArrays)
Stars: ✭ 1,742 (+1046.05%)
Mutual labels:  spark

Azure Distributed Data Engineering Toolkit (AZTK)

Azure Distributed Data Engineering Toolkit (AZTK) is a python CLI application for provisioning on-demand Spark on Docker clusters in Azure. It's a cheap and easy way to get up and running with a Spark cluster, and a great tool for Spark users who want to experiment and start testing at scale.

This toolkit is built on top of Azure Batch but does not require any Azure Batch knowledge to use.

Status

This repository has been marked for archival. It is no longer maintained.

Notable Features

Setup

  1. Install aztk with pip:
    pip install aztk
  1. Initialize the project in a directory. This will automatically create a .aztk folder with config files in your working directory:
    aztk spark init
  1. Login or register for an Azure Account, navigate to Azure Cloud Shell, and run:
wget -q https://raw.githubusercontent.com/Azure/aztk/v0.10.3/account_setup.sh -O account_setup.sh &&
chmod 755 account_setup.sh &&
/bin/bash account_setup.sh
  1. Follow the on screen prompts to create the necessary Azure resources and copy the output into your .aztk/secrets.yaml file. For more information see Getting Started Scripts.

Quickstart Guide

The core experience of this package is centered around a few commands.

# create your cluster
aztk spark cluster create
aztk spark cluster add-user
# monitor and manage your clusters
aztk spark cluster get
aztk spark cluster list
aztk spark cluster delete
# login and submit applications to your cluster
aztk spark cluster ssh
aztk spark cluster submit

1. Create and setup your cluster

First, create your cluster:

aztk spark cluster create --id my_cluster --size 5 --vm-size standard_d2_v2
  • See our available VM sizes here.
  • The --vm-size argument must be the official SKU name which usually come in the form: "standard_d2_v2"
  • You can create low-priority VMs at an 80% discount by using --size-low-pri instead of --size
  • By default, AZTK runs Spark 2.2.0 on an Ubuntu16.04 Docker image. More info here
  • By default, AZTK will create a user (with the username spark) for your cluster
  • The cluster id (--id) can only contain alphanumeric characters including hyphens and underscores, and cannot contain more than 64 characters.
  • By default, you cannot create clusters of more than 20 cores in total. Visit this page to request a core quota increase.

More information regarding using a cluster can be found in the cluster documentation

2. Check on your cluster status

To check your cluster status, use the get command:

aztk spark cluster get --id my_cluster

3. Submit a Spark job

When your cluster is ready, you can submit jobs from your local machine to run against the cluster. The output of the spark-submit will be streamed to your local console. Run this command from the cloned AZTK repo:

// submit a java application
aztk spark cluster submit \
    --id my_cluster \
    --name my_java_job \
    --class org.apache.spark.examples.SparkPi \
    --executor-memory 20G \
    path\to\examples.jar 1000
    
// submit a python application
aztk spark cluster submit \
    --id my_cluster \
    --name my_python_job \
    --executor-memory 20G \
    path\to\pi.py 1000
  • The aztk spark cluster submit command takes the same parameters as the standard spark-submit command, except instead of specifying --master, AZTK requires that you specify your cluster --id and a unique job --name
  • The job name, --name, argument must be at least 3 characters long
    • It can only contain alphanumeric characters including hyphens but excluding underscores
    • It cannot contain uppercase letters
  • Each job you submit must have a unique name
  • Use the --no-wait option for your command to return immediately

Learn more about the spark submit command here

4. Log in and Interact with your Spark Cluster

Most users will want to work interactively with their Spark clusters. With the aztk spark cluster ssh command, you can SSH into the cluster's master node. This command also helps you port-forward your Spark Web UI and Spark Jobs UI to your local machine:

aztk spark cluster ssh --id my_cluster --user spark

By default, we port forward the Spark Web UI to localhost:8080, Spark Jobs UI to localhost:4040, and the Spark History Server to localhost:18080.

You can configure these settings in the .aztk/ssh.yaml file.

NOTE: When working interactively, you may want to use tools like Jupyter or RStudio-Server. To do so, you need to setup your cluster with the appropriate docker image and plugin. See Plugins for more information.

5. Manage and Monitor your Spark Cluster

You can also see your clusters from the CLI:

aztk spark cluster list

And get the state of any specified cluster:

aztk spark cluster get --id <my_cluster_id>

Finally, you can delete any specified cluster:

aztk spark cluster delete --id <my_cluster_id>

FAQs

Next Steps

You can find more documentation here

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].