All Projects → osin-vladimir → architect_big_data_solutions_with_spark

osin-vladimir / architect_big_data_solutions_with_spark

Licence: MIT license
code, labs and lectures for the course

Programming Languages

Jupyter Notebook
11667 projects
python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to architect big data solutions with spark

Datacleaner
The premier open source Data Quality solution
Stars: ✭ 391 (+877.5%)
Mutual labels:  etl, data-analysis
Awesome Business Intelligence
Actively curated list of awesome BI tools. PRs welcome!
Stars: ✭ 1,157 (+2792.5%)
Mutual labels:  etl, data-analysis
Getting Started
This repository is a getting started guide to Singer.
Stars: ✭ 734 (+1735%)
Mutual labels:  etl, data-analysis
Spark
.NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers.
Stars: ✭ 1,721 (+4202.5%)
Mutual labels:  spark-streaming, databricks
Eland
Python Client and Toolkit for DataFrames, Big Data, Machine Learning and ETL in Elasticsearch
Stars: ✭ 235 (+487.5%)
Mutual labels:  etl, data-analysis
bandar-log
Monitoring tool to measure flow throughput of data sources and processing components that are part of Data Ingestion and ETL pipelines.
Stars: ✭ 20 (-50%)
Mutual labels:  etl, spark-streaming
Ether sql
A python library to push ethereum blockchain data into an sql database.
Stars: ✭ 41 (+2.5%)
Mutual labels:  etl, data-analysis
dflib
In-memory Java DataFrame library
Stars: ✭ 50 (+25%)
Mutual labels:  etl, data-analysis
Airbyte
Airbyte is an open-source EL(T) platform that helps you replicate your data in your warehouses, lakes and databases.
Stars: ✭ 4,919 (+12197.5%)
Mutual labels:  etl, data-analysis
Etl unicorn
数据可视化, 数据挖掘, 数据处理 ETL
Stars: ✭ 156 (+290%)
Mutual labels:  etl, data-analysis
Bandar Log
Monitoring tool to measure flow throughput of data sources and processing components that are part of Data Ingestion and ETL pipelines.
Stars: ✭ 19 (-52.5%)
Mutual labels:  etl, spark-streaming
dbt-databricks
A dbt adapter for Databricks.
Stars: ✭ 115 (+187.5%)
Mutual labels:  etl, databricks
Setl
A simple Spark-powered ETL framework that just works 🍺
Stars: ✭ 79 (+97.5%)
Mutual labels:  etl, data-analysis
blackbricks
Black for Databricks notebooks
Stars: ✭ 40 (+0%)
Mutual labels:  databricks, databricks-notebooks
nutter
Testing framework for Databricks notebooks
Stars: ✭ 152 (+280%)
Mutual labels:  databricks, databricks-notebooks
hamilton
A scalable general purpose micro-framework for defining dataflows. You can use it to create dataframes, numpy matrices, python objects, ML models, etc.
Stars: ✭ 612 (+1430%)
Mutual labels:  etl
antz
ANTz immersive 3D data visualization engine
Stars: ✭ 25 (-37.5%)
Mutual labels:  data-analysis
copulae
Multivariate data modelling with Copulas in Python
Stars: ✭ 96 (+140%)
Mutual labels:  data-analysis
python-notebooks
A collection of Jupyter Notebooks used in conferences or just to have some snippets.
Stars: ✭ 14 (-65%)
Mutual labels:  data-analysis
dsr
Introduction to Data Science with R (2017)
Stars: ✭ 25 (-37.5%)
Mutual labels:  data-analysis

Architect Big Data Solutions with Apache Spark


Introduction

This repository contains lectures and codes for the course that aims to provide a gentle introduction on how to build distributed big data pipelines with the help of Apache Spark. Apache Spark is an open-source data processing engine for engineers and analysts that includes an optimized general execution runtime and a set of standard libraries for building data pipelines, advanced algorithms, and more. Spark is rapidly becoming the compute engine of choice for big data. Spark programs are more concise and often run 10-100 times faster than Hadoop MapReduce jobs. As companies realize this, Spark developers are becoming increasingly valued.

In this course we will learn the architectural and practical part of using Apache Spark to implement big data solutions. We will use the Spark Core, SparkSQL, Spark Streaming, and Spark ML to implement different advanced analytics and machine learning algorithms in a production like data pipeline. This course will master your skills in designing solutions for common Big Data tasks such as creating batch and real-time data processing pipelines, doing machine learning at scale, deploying machine learning models into a production environment, and much more!


Content

  1. Introduction [lecture 1] [labs] [pyspark Python cheat sheet]
  2. SQL and DataFrame [labs] [pyspark SQL cheat sheet]
  3. Batch Processing [lecture 2] [lecture 3]
  4. Stream Processing [lecture 4] [lecture 5] [labs]
  5. Machine Learning [lecture 6] [labs]

Computational Resources

  1. Please register for community version of DataBricks here.
  2. Please register for free tier AWS account here

Data Sources

You can find data and additional information from the links below:

  1. MovieLens DataSet
  2. House Prices: Advanced Regression Techniques
  3. Titanic: Machine Learning from Disaster

Note: For you convenience data already downloaded to Datasets folder of this repository.

Note: You can upload data to DataBricks directly or use AWS S3 bucket for storage:


Additional Resources

We provide links for nice cheat sheets and books in order to make course as smooth as possible:

  1. A Gentle Introduction to Apache Spark
  2. How to import Data to DataBricks using S3
  3. Python Cheat Sheet
  4. Machine Learning Tutorial for AWS
  5. DataBricks Development Documentation
  6. Developers Guide for AWS Machine Learning
  7. Superset

Course Initiative:

If you like the initiative please star/fork that repository and feel free to contribute with pull requests.


Places where this course has been taught (physically)

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].