All Projects → microsoft → Azure-Databricks-NYC-Taxi-Workshop

microsoft / Azure-Databricks-NYC-Taxi-Workshop

Licence: MIT License
An Azure Databricks workshop leveraging the New York Taxi and Limousine Commission Trip Records dataset

Programming Languages

scala
5932 projects
python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Azure-Databricks-NYC-Taxi-Workshop

databricks-notebooks
Collection of Databricks and Jupyter Notebooks
Stars: ✭ 19 (-73.24%)
Mutual labels:  pyspark, azure-databricks
az-ml-batch-score
Deploying a Batch Scoring Pipeline for Python Models
Stars: ✭ 17 (-76.06%)
Mutual labels:  azure-machine-learning
phrase-at-scale
Detect common phrases in large amounts of text using a data-driven approach. Size of discovered phrases can be arbitrary. Can be used in languages other than English
Stars: ✭ 115 (+61.97%)
Mutual labels:  pyspark
Springboard-Data-Science-Immersive
No description or website provided.
Stars: ✭ 52 (-26.76%)
Mutual labels:  pyspark
aml-deploy
GitHub Action that allows you to deploy machine learning models in Azure Machine Learning.
Stars: ✭ 37 (-47.89%)
Mutual labels:  azure-machine-learning
kuwala
Kuwala is the no-code data platform for BI analysts and engineers enabling you to build powerful analytics workflows. We are set out to bring state-of-the-art data engineering tools you love, such as Airbyte, dbt, or Great Expectations together in one intuitive interface built with React Flow. In addition we provide third-party data into data sc…
Stars: ✭ 474 (+567.61%)
Mutual labels:  pyspark
aml-keras-image-recognition
A sample Azure Machine Learning project for Transfer Learning-based custom image recognition by utilizing Keras.
Stars: ✭ 14 (-80.28%)
Mutual labels:  azure-machine-learning
pyspark-cheatsheet
PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
Stars: ✭ 115 (+61.97%)
Mutual labels:  pyspark
machine-learning-course
Machine Learning Course @ Santa Clara University
Stars: ✭ 17 (-76.06%)
Mutual labels:  pyspark
sparklanes
A lightweight data processing framework for Apache Spark
Stars: ✭ 17 (-76.06%)
Mutual labels:  pyspark
jobAnalytics and search
JobAnalytics system consumes data from multiple sources and provides valuable information to both job hunters and recruiters.
Stars: ✭ 25 (-64.79%)
Mutual labels:  pyspark
SynapseML
Simple and Distributed Machine Learning
Stars: ✭ 3,355 (+4625.35%)
Mutual labels:  pyspark
DataEngineering
This repo contains commands that data engineers use in day to day work.
Stars: ✭ 47 (-33.8%)
Mutual labels:  pyspark
jupyterlab-sparkmonitor
JupyterLab extension that enables monitoring launched Apache Spark jobs from within a notebook
Stars: ✭ 78 (+9.86%)
Mutual labels:  pyspark
dlsa
Distributed least squares approximation (dlsa) implemented with Apache Spark
Stars: ✭ 25 (-64.79%)
Mutual labels:  pyspark
pyspark-k8s-boilerplate
Boilerplate for PySpark on Cloud Kubernetes
Stars: ✭ 24 (-66.2%)
Mutual labels:  pyspark
Spark-for-data-engineers
Apache Spark for data engineers
Stars: ✭ 22 (-69.01%)
Mutual labels:  pyspark
aml-workspace
GitHub Action that allows you to create or connect to your Azure Machine Learning Workspace.
Stars: ✭ 22 (-69.01%)
Mutual labels:  azure-machine-learning
ODSC India 2018
My presentation at ODSC India 2018 about Deep Learning with Apache Spark
Stars: ✭ 26 (-63.38%)
Mutual labels:  pyspark
big data
A collection of tutorials on Hadoop, MapReduce, Spark, Docker
Stars: ✭ 34 (-52.11%)
Mutual labels:  pyspark

Azure Databricks NYC Taxi Workshop

This is a multi-part (free) workshop featuring Azure Databricks. It covers basics of working with Azure Data Services from Spark on Databricks with Chicago crimes public dataset, followed by an end-to-end data engineering workshop with the NYC Taxi public dataset, and finally an end-to-end machine learning workshop. The workshop is offered in Scala and Python.

The goal of this workshop is deliver a clear understanding of how to provision Azure data services, how the data services services integrate with Spark on Azure Databricks, to give you end to end experience with basic data engineering and basic data science on Azure Databricks, and to share some boilerplate code to use in your projects.

This is a community contribution, so we appreciate feedback and contribution.

Target Audience

  • Architects
  • Data Engineers
  • Data Scientists

Pre-Requisite Knowledge

  • Prior knowledge of Spark, is beneficial
  • Familiarity/experience with Scala/Python

Azure Pre-Requisites

A subscription with at least $200 credit for a continuous 10-14 hours of usage.

1. Module 1 - Primer

This module covers basics of integrating with Azure Data Services from Spark on Azure Databricks in batch mode and with structured streaming.

primer

At the end of this module, you will know how to provision, configure, and integrate from Spark with:

  1. Azure storage - blob storage, ADLS gen1 and ADLS gen2; Includes Databricks Delta as well
  2. Azure Event Hub - publish and subscribe in batch and with structured streaming; Includes Databricks Delta
  3. HDInsight Kafka - publish and subscribe in batch and with structured streaming; Includes Databricks Delta
  4. Azure SQL database - read/write primer in batch and structured streaming
  5. Azure SQL datawarehouse - read/write primer in batch and structured streaming
  6. Azure Cosmos DB (core API - SQL API/document oriented) - read/write primer in batch and structured streaming; Includes structured streaming aggregation computation
  7. Azure Data Factory - automating Spark notebooks in Azure Databricks with Azure Data Factory version 2
  8. Azure Key Vault for secrets management

The Chicago crimes dataset is leveraged in the lab.

2. Module 2 - Data Engineering Workshop

This is a batch focused module and covers building blocks of standing up a data engineering pipeline. The NYC taxi dataset (yellow and green taxi trips) is leveraged in the labs.

primer primer

3. Module 3 - Data Science Workshop

There are two versions of the Data Science Workshop - the one using Scala will show Spark MLLib models. The PySpark version will show Spark ML and Azure Machine Learning services working together.

If you would like to run Module 3 as standalone, you'll need to:

  1. Provision:
    1. Azure Databricks
    2. Azure Storage account
    3. Azure Machine Learning services Workspace
  2. Import the DBC file into the Databricks workspace
  3. Set the module_3_only flag in 99-Shared-Functions-and-Settings to True

The following is a summary of content covered:

  1. Perform feature engineering and feature selection activities
  2. Create an Azure Machine Learning (AML) service workspace
  3. Connect to an AML workspace
  4. Create PySpark models and leverage AML Experiment tracking
  5. Leverage Automated ML capabilities in AML
  6. Deploy the best performing model as a REST API in a Docker continer

Next

Credits

  • Anagha Khanolkar (Chicago) - creator of workshop, primary author of workshop, content design, all development in Scala, primer module in Pyspark
  • Ryan Murphy (St Louis) - contribution to the data engineering workshop transformation rules, schema and more
  • Rajdeep Biswas (Houston) - writing the entire PySpark version of the data engineering lab
  • Steve Howard (St Louis) - contributing to the PySpark version of the data engineering lab
  • Erik Zwiefel (Minneapolis) - content design of data science lab, PySpark version, Azure Machine Learning service integration for operationalization as a REST service, AutoML
  • Thomas Abraham (St Louis) - development of ADFv2 integration primer in Pyspark
  • Matt Stenzel, Christopher House (Minneapolis) - testing
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].