Alternatives and detailed information of architect_big_data_solutions_with_spark

A scalable general purpose micro-framework for defining dataflows. You can use it to create dataframes, numpy matrices, python objects, ML models, etc.

Stars: ✭ 612 (+1430%)

Mutual labels: etl

antz

ANTz immersive 3D data visualization engine

Stars: ✭ 25 (-37.5%)

Mutual labels: data-analysis

copulae

Multivariate data modelling with Copulas in Python

Stars: ✭ 96 (+140%)

Mutual labels: data-analysis

python-notebooks

A collection of Jupyter Notebooks used in conferences or just to have some snippets.

Stars: ✭ 14 (-65%)

Mutual labels: data-analysis

dsr

Introduction to Data Science with R (2017)

Stars: ✭ 25 (-37.5%)

Mutual labels: data-analysis

View All Similar Projects ➔

Architect Big Data Solutions with Apache Spark

Introduction

This repository contains lectures and codes for the course that aims to provide a gentle introduction on how to build distributed big data pipelines with the help of Apache Spark. Apache Spark is an open-source data processing engine for engineers and analysts that includes an optimized general execution runtime and a set of standard libraries for building data pipelines, advanced algorithms, and more. Spark is rapidly becoming the compute engine of choice for big data. Spark programs are more concise and often run 10-100 times faster than Hadoop MapReduce jobs. As companies realize this, Spark developers are becoming increasingly valued.

In this course we will learn the architectural and practical part of using Apache Spark to implement big data solutions. We will use the Spark Core, SparkSQL, Spark Streaming, and Spark ML to implement different advanced analytics and machine learning algorithms in a production like data pipeline. This course will master your skills in designing solutions for common Big Data tasks such as creating batch and real-time data processing pipelines, doing machine learning at scale, deploying machine learning models into a production environment, and much more!

Content

Introduction [lecture 1] [labs] [pyspark Python cheat sheet]
SQL and DataFrame [labs] [pyspark SQL cheat sheet]
Batch Processing [lecture 2] [lecture 3]
Stream Processing [lecture 4] [lecture 5] [labs]
Machine Learning [lecture 6] [labs]

Computational Resources

Please register for community version of DataBricks here.
Please register for free tier AWS account here

Data Sources

You can find data and additional information from the links below:

Note: For you convenience data already downloaded to Datasets folder of this repository.

Note: You can upload data to DataBricks directly or use AWS S3 bucket for storage:

Additional Resources

We provide links for nice cheat sheets and books in order to make course as smooth as possible:

Course Initiative:

If you like the initiative please star/fork that repository and feel free to contribute with pull requests.

Places where this course has been taught (physically)

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

osin-vladimir / architect_big_data_solutions_with_spark

Programming Languages

Labels

Projects that are alternatives of or similar to architect big data solutions with spark

Architect Big Data Solutions with Apache Spark

Introduction

Content

Computational Resources

Data Sources

Additional Resources

Course Initiative:

Places where this course has been taught (physically)