microsoft / Azure-Databricks-NYC-Taxi-Workshop

Licence: MIT License

An Azure Databricks workshop leveraging the New York Taxi and Limousine Commission Trip Records dataset

Programming Languages

scala

5932 projects

python

139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Azure-Databricks-NYC-Taxi-Workshop

databricks-notebooks

Collection of Databricks and Jupyter Notebooks

Stars: ✭ 19 (-73.24%)

Mutual labels: pyspark, azure-databricks

az-ml-batch-score

Deploying a Batch Scoring Pipeline for Python Models

Stars: ✭ 17 (-76.06%)

Mutual labels: azure-machine-learning

phrase-at-scale

Detect common phrases in large amounts of text using a data-driven approach. Size of discovered phrases can be arbitrary. Can be used in languages other than English

Stars: ✭ 115 (+61.97%)

Mutual labels: pyspark

Springboard-Data-Science-Immersive

No description or website provided.

Stars: ✭ 52 (-26.76%)

Mutual labels: pyspark

aml-deploy

GitHub Action that allows you to deploy machine learning models in Azure Machine Learning.

Stars: ✭ 37 (-47.89%)

Mutual labels: azure-machine-learning

kuwala

Kuwala is the no-code data platform for BI analysts and engineers enabling you to build powerful analytics workflows. We are set out to bring state-of-the-art data engineering tools you love, such as Airbyte, dbt, or Great Expectations together in one intuitive interface built with React Flow. In addition we provide third-party data into data sc…

Stars: ✭ 474 (+567.61%)

Mutual labels: pyspark

aml-keras-image-recognition

A sample Azure Machine Learning project for Transfer Learning-based custom image recognition by utilizing Keras.

Stars: ✭ 14 (-80.28%)

Mutual labels: azure-machine-learning

pyspark-cheatsheet

PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster

Stars: ✭ 115 (+61.97%)

Mutual labels: pyspark

machine-learning-course

Machine Learning Course @ Santa Clara University

Stars: ✭ 17 (-76.06%)

Mutual labels: pyspark

sparklanes

A lightweight data processing framework for Apache Spark

Stars: ✭ 17 (-76.06%)

Mutual labels: pyspark

jobAnalytics and search

JobAnalytics system consumes data from multiple sources and provides valuable information to both job hunters and recruiters.

Stars: ✭ 25 (-64.79%)

Mutual labels: pyspark

SynapseML

Simple and Distributed Machine Learning

Stars: ✭ 3,355 (+4625.35%)

Mutual labels: pyspark

DataEngineering

This repo contains commands that data engineers use in day to day work.

Stars: ✭ 47 (-33.8%)

Mutual labels: pyspark

jupyterlab-sparkmonitor

JupyterLab extension that enables monitoring launched Apache Spark jobs from within a notebook

Stars: ✭ 78 (+9.86%)

Mutual labels: pyspark

dlsa

Distributed least squares approximation (dlsa) implemented with Apache Spark

Stars: ✭ 25 (-64.79%)

Mutual labels: pyspark

pyspark-k8s-boilerplate

Boilerplate for PySpark on Cloud Kubernetes

Stars: ✭ 24 (-66.2%)

Mutual labels: pyspark

Spark-for-data-engineers

Apache Spark for data engineers

Stars: ✭ 22 (-69.01%)

Mutual labels: pyspark

aml-workspace

GitHub Action that allows you to create or connect to your Azure Machine Learning Workspace.

Stars: ✭ 22 (-69.01%)

Mutual labels: azure-machine-learning

ODSC India 2018

My presentation at ODSC India 2018 about Deep Learning with Apache Spark

Stars: ✭ 26 (-63.38%)

Mutual labels: pyspark

big data

A collection of tutorials on Hadoop, MapReduce, Spark, Docker

Stars: ✭ 34 (-52.11%)

Mutual labels: pyspark

View All Similar Projects ➔

Azure Databricks NYC Taxi Workshop

This is a multi-part (free) workshop featuring Azure Databricks. It covers basics of working with Azure Data Services from Spark on Databricks with Chicago crimes public dataset, followed by an end-to-end data engineering workshop with the NYC Taxi public dataset, and finally an end-to-end machine learning workshop. The workshop is offered in Scala and Python.

The goal of this workshop is deliver a clear understanding of how to provision Azure data services, how the data services services integrate with Spark on Azure Databricks, to give you end to end experience with basic data engineering and basic data science on Azure Databricks, and to share some boilerplate code to use in your projects.

This is a community contribution, so we appreciate feedback and contribution.

Target Audience

Architects
Data Engineers
Data Scientists

Pre-Requisite Knowledge

Prior knowledge of Spark, is beneficial
Familiarity/experience with Scala/Python

Azure Pre-Requisites

A subscription with at least $200 credit for a continuous 10-14 hours of usage.

1. Module 1 - Primer

This module covers basics of integrating with Azure Data Services from Spark on Azure Databricks in batch mode and with structured streaming.

At the end of this module, you will know how to provision, configure, and integrate from Spark with:

Azure storage - blob storage, ADLS gen1 and ADLS gen2; Includes Databricks Delta as well
Azure Event Hub - publish and subscribe in batch and with structured streaming; Includes Databricks Delta
HDInsight Kafka - publish and subscribe in batch and with structured streaming; Includes Databricks Delta
Azure SQL database - read/write primer in batch and structured streaming
Azure SQL datawarehouse - read/write primer in batch and structured streaming
Azure Cosmos DB (core API - SQL API/document oriented) - read/write primer in batch and structured streaming; Includes structured streaming aggregation computation
Azure Data Factory - automating Spark notebooks in Azure Databricks with Azure Data Factory version 2
Azure Key Vault for secrets management

The Chicago crimes dataset is leveraged in the lab.

2. Module 2 - Data Engineering Workshop

This is a batch focused module and covers building blocks of standing up a data engineering pipeline. The NYC taxi dataset (yellow and green taxi trips) is leveraged in the labs.

3. Module 3 - Data Science Workshop

There are two versions of the Data Science Workshop - the one using Scala will show Spark MLLib models. The PySpark version will show Spark ML and Azure Machine Learning services working together.

If you would like to run Module 3 as standalone, you'll need to:

Provision:
1. Azure Databricks
2. Azure Storage account
3. Azure Machine Learning services Workspace
Import the DBC file into the Databricks workspace
Set the module_3_only flag in 99-Shared-Functions-and-Settings to True

The following is a summary of content covered:

Perform feature engineering and feature selection activities
Create an Azure Machine Learning (AML) service workspace
Connect to an AML workspace
Create PySpark models and leverage AML Experiment tracking
Leverage Automated ML capabilities in AML
Deploy the best performing model as a REST API in a Docker continer

Credits

Anagha Khanolkar (Chicago) - creator of workshop, primary author of workshop, content design, all development in Scala, primer module in Pyspark
Ryan Murphy (St Louis) - contribution to the data engineering workshop transformation rules, schema and more
Rajdeep Biswas (Houston) - writing the entire PySpark version of the data engineering lab
Steve Howard (St Louis) - contributing to the PySpark version of the data engineering lab
Erik Zwiefel (Minneapolis) - content design of data science lab, PySpark version, Azure Machine Learning service integration for operationalization as a REST service, AutoML
Thomas Abraham (St Louis) - development of ADFv2 integration primer in Pyspark
Matt Stenzel, Christopher House (Minneapolis) - testing

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

microsoft / Azure-Databricks-NYC-Taxi-Workshop

Programming Languages

Labels

Projects that are alternatives of or similar to Azure-Databricks-NYC-Taxi-Workshop

Azure Databricks NYC Taxi Workshop

Target Audience

Pre-Requisite Knowledge

Azure Pre-Requisites

1. Module 1 - Primer

2. Module 2 - Data Engineering Workshop

3. Module 3 - Data Science Workshop

Next

Credits