All Projects → rjurney → Agile_data_code_2

rjurney / Agile_data_code_2

Licence: mit
Code for Agile Data Science 2.0, O'Reilly 2017, Second Edition

Programming Languages

python
139335 projects - #7 most used programming language
python3
1442 projects

Projects that are alternatives of or similar to Agile data code 2

Wirbelsturm
Wirbelsturm is a Vagrant and Puppet based tool to perform 1-click local and remote deployments, with a focus on big data tech like Kafka.
Stars: ✭ 332 (-19.61%)
Mutual labels:  apache-kafka, kafka, spark, apache-spark, vagrant
Introduction Datascience Python Book
Introduction to Data Science: A Python Approach to Concepts, Techniques and Applications
Stars: ✭ 275 (-33.41%)
Mutual labels:  jupyter-notebook, data-science, analytics, data
Spark With Python
Fundamentals of Spark with Python (using PySpark), code examples
Stars: ✭ 150 (-63.68%)
Mutual labels:  jupyter-notebook, spark, analytics, apache-spark
Kafka Storm Starter
Code examples that show to integrate Apache Kafka 0.8+ with Apache Storm 0.9+ and Apache Spark Streaming 1.1+, while using Apache Avro as the data serialization format.
Stars: ✭ 728 (+76.27%)
Mutual labels:  apache-kafka, kafka, spark, apache-spark
Udacity Data Engineering
Udacity Data Engineering Nano Degree (DEND)
Stars: ✭ 89 (-78.45%)
Mutual labels:  airflow, jupyter-notebook, spark
Stats Maths With Python
General statistics, mathematical programming, and numerical/scientific computing scripts and notebooks in Python
Stars: ✭ 381 (-7.75%)
Mutual labels:  jupyter-notebook, data-science, analytics
Data Science Stack Cookiecutter
🐳📊🤓Cookiecutter template to launch an awesome dockerized Data Science toolstack (incl. Jupyster, Superset, Postgres, Minio, AirFlow & API Star)
Stars: ✭ 153 (-62.95%)
Mutual labels:  airflow, jupyter-notebook, data-science
Spark Jupyter Aws
A guide on how to set up Jupyter with Pyspark painlessly on AWS EC2 clusters, with S3 I/O support
Stars: ✭ 259 (-37.29%)
Mutual labels:  jupyter-notebook, spark, apache-spark
Notebooks Statistics And Machinelearning
Jupyter Notebooks from the old UnsupervisedLearning.com (RIP) machine learning and statistics blog
Stars: ✭ 270 (-34.62%)
Mutual labels:  jupyter-notebook, data-science, machine-learning-algorithms
Data Science Hacks
Data Science Hacks consists of tips, tricks to help you become a better data scientist. Data science hacks are for all - beginner to advanced. Data science hacks consist of python, jupyter notebook, pandas hacks and so on.
Stars: ✭ 273 (-33.9%)
Mutual labels:  jupyter-notebook, data-science, data
Zat
Zeek Analysis Tools (ZAT): Processing and analysis of Zeek network data with Pandas, scikit-learn, Kafka and Spark
Stars: ✭ 303 (-26.63%)
Mutual labels:  jupyter-notebook, kafka, spark
Goodreads etl pipeline
An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
Stars: ✭ 793 (+92.01%)
Mutual labels:  airflow, spark, apache-spark
Azkarra Streams
🚀 Azkarra is a lightweight java framework to make it easy to develop, deploy and manage cloud-native streaming microservices based on Apache Kafka Streams.
Stars: ✭ 146 (-64.65%)
Mutual labels:  apache-kafka, kafka, data
Beyond Jupyter
🐍💻📊 All material from the PyCon.DE 2018 Talk "Beyond Jupyter Notebooks - Building your own data science platform with Python & Docker" (incl. Slides, Video, Udemy MOOC & other References)
Stars: ✭ 135 (-67.31%)
Mutual labels:  airflow, jupyter-notebook, data-science
Oryx
Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning
Stars: ✭ 1,785 (+332.2%)
Mutual labels:  apache-kafka, kafka, apache-spark
Awesome Pulsar
A curated list of Pulsar tools, integrations and resources.
Stars: ✭ 57 (-86.2%)
Mutual labels:  apache-kafka, spark, apache-spark
Spark Notebook
Interactive and Reactive Data Science using Scala and Spark.
Stars: ✭ 3,081 (+646%)
Mutual labels:  data-science, spark, apache-spark
Awesome Kafka
A list about Apache Kafka
Stars: ✭ 397 (-3.87%)
Mutual labels:  apache-kafka, kafka, apache-spark
Mydatascienceportfolio
Applying Data Science and Machine Learning to Solve Real World Business Problems
Stars: ✭ 227 (-45.04%)
Mutual labels:  jupyter-notebook, data-science, spark
Datascience course
Curso de Data Science em Português
Stars: ✭ 294 (-28.81%)
Mutual labels:  jupyter-notebook, data-science, data

Agile_Data_Code_2

Code for Agile Data Science 2.0, O'Reilly 2017. Now available at the O'Reilly Store, on Amazon (in Paperback and Kindle) and on O'Reilly Safari. Also available anywhere technical books are sold!

This is also the code for the Realtime Predictive Analytics video course and Introduction to PySpark live course!

Have problems? Please file an issue!

Data Syndrome

Like my work? I am Principal Consultant at Data Syndrome, a consultancy offering assistance and training with building full-stack analytics products, applications and systems. Find us on the web at datasyndrome.com.

Data Syndrome Logo

Realtime Predictive Analytics Course

There is now a video course using code from chapter 8, Realtime Predictive Analytics with Kafka, PySpark, Spark MLlib and Spark Streaming. Check it out now at datasyndrome.com/video.

A free preview of the course is available at https://vimeo.com/202336113

Installation

There are two methods of installation: Vagrant/Virtualbox or Amazon EC2.

Amazon EC2

Amazon EC2 is the preferred environment for this book/course, because it is simple and painless. Installation takes just a few moments using Amazon EC2.

First you will need to install the Amazon CLI via:

pip install awscli

Now you must authenticate into the AWS CLI via (see Set up AWS Credentials and Region for Development or below):

aws configure

Provide it your AWS credentials consisting of your access key ID/secret access key (available from AWS Dashboard at: user name dropdown (Ex. Russell Jurney)-> My Security Credentials->Access keys), a default AWS region (for example, 'us-west-2' or 'us-east-1') and the 'json' output format.

Now run the following command to bring up a machine pre-configured with the book's complete environment and source code:

./ec2.sh

How it Works

The script ec2.sh uses the file aws/ec2_bootstrap.sh as --user-data to boot a single r3.xlarge EC2 instance in the us-east-1 region with all dependencies installed and running.

In addition, it uses the AWS CLI to create a key-pair called agile_data_science (which then appears in this directory under agile_data_science.pem). It also creates a security group called agile_data_science with port 22 open only to your external IP address.

Note: this script uses the utility jq to parse the JSON returned by the AWS CLI. The script will detect whether jq is installed and will attempt to use the script jq_install.sh to install it locally if it is not present. If the install fails, you will be instructed to install it yourself.

Next Steps

When it succeeds, the ec2 script will print instructions on what to do next: how to ssh into the ec2 instance, and how to create an ssh tunnel to forward web applications run on the ec2 instance to your local port 5000 where you can view them at http://localhost:5000.

The script to create an ssh tunnel is ec2_create_tunnel.sh.

Now jump ahead to "Downloading Data".

Vagrant/Virtualbox Install

Note: Vagrant 2.2+ required. The Ubuntu package is out of date so check the version with: vagrant —version

Installation takes a few minutes, using Vagrant and Virtualbox.

Note: Vagrant/Virtualbox method requires 9GB free RAM, which will mean closing most programs on a 16GB Macbook Pro. If you don't close most everything, you will run out of RAM and your system will crash. Use the EC2 method if this is a problem for you.

vagrant box update
vagrant up
vagrant ssh

Now jump ahead to Downloading Data.

Manual Install

For a manual install read Appendix A for further setup instructions. Checkout manual_install.sh if you want to install the tools yourself and run the example code.

Note: You must READ THE MANUAL INSTALL SCRIPT BEFORE RUNNING IT. It does things to your ~/.bash_profile that you should know about. Again, this is not recommended for beginners.

Note: You must have Java installed on your computer for these instructions to work. You can find more information about how to install Java here: https://www.java.com/en/download/help/download_options.xml

Downloading Data

Once the server comes up, download the data and you are ready to go. First change directory into the Agile_Data_Code_2 directory.

cd Agile_Data_Code_2

Now download the data, depending on which activity this is for.

For the book Agile Data Science 2.0, run:

./download.sh

For the Introduction to PySpark course, run:

./intro_download.sh

For the Realtime Predictive Analytics video course, or to skip ahead to chapter 8 in the book, run:

ch08/download_data.sh

Running Examples

All scripts run from the base directory, except the web app which runs in ex. ch08/web/.

Jupyter Notebooks

All notebooks assume you have run the jupyter notebook command from the project root directory Agile_Data_Code_2. If you are using a virtual machine image (Vagrant/Virtualbox or EC2), jupyter notebook is already running. See directions on port mapping to proceed.

The Data Value Pyramid

Originally by Pete Warden, the data value pyramid is how the book is organized and structured. We climb it as we go forward each chapter.

Data Value Pyramid

System Architecture

The following diagrams are pulled from the book, and express the basic concepts in the system architecture. The front and back end architectures work together to make a complete predictive system.

Front End Architecture

This diagram shows how the front end architecture works in our flight delay prediction application. The user fills out a form with some basic information in a form on a web page, which is submitted to the server. The server fills out some neccesary fields derived from those in the form like "day of year" and emits a Kafka message containing a prediction request. Spark Streaming is listening on a Kafka queue for these requests, and makes the prediction, storing the result in MongoDB. Meanwhile, the client has received a UUID in the form's response, and has been polling another endpoint every second. Once the data is available in Mongo, the client's next request picks it up. Finally, the client displays the result of the prediction to the user!

This setup is extremely fun to setup, operate and watch. Check out chapters 7 and 8 for more information!

Front End Architecture

Back End Architecture

The back end architecture diagram shows how we train a classifier model using historical data (all flights from 2015) on disk (HDFS or Amazon S3, etc.) to predict flight delays in batch in Spark. We save the model to disk when it is ready. Next, we launch Zookeeper and a Kafka queue. We use Spark Streaming to load the classifier model, and then listen for prediction requests in a Kafka queue. When a prediction request arrives, Spark Streaming makes the prediction, storing the result in MongoDB where the web application can pick it up.

This architecture is extremely powerful, and it is a huge benefit that we get to use the same code in batch and in realtime with PySpark Streaming.

Backend Architecture

Screenshots

Below are some examples of parts of the application we build in this book and in this repo. Check out the book for more!

Airline Entity Page

Each airline gets its own entity page, complete with a summary of its fleet and a description pulled from Wikipedia.

Airline Page

Airplane Fleet Page

We demonstrate summarizing an entity with an airplane fleet page which describes the entire fleet.

Airplane Fleet Page

Flight Delay Prediction UI

We create an entire realtime predictive system with a web front-end to submit prediction requests.

Predicting Flight Delays UI

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].