All Projects → datamindedbe → Python And Spark For Data Analysis

datamindedbe / Python And Spark For Data Analysis

A four-day course on Python, the Scientific Python stack and PySpark, adapted from a training course given by Patrick Varilly to one of our clients in December 2015

Projects that are alternatives of or similar to Python And Spark For Data Analysis

Machinelearningtutorial
Short Machine Learning Tutorial
Stars: ✭ 9 (-10%)
Mutual labels:  jupyter-notebook
Keras Tutorial
3-hour tutorial on building deep learning models with Keras.
Stars: ✭ 10 (+0%)
Mutual labels:  jupyter-notebook
Variational Autoencoders Summerschool 2016
Exercises for the semi-supervised summer school https://semisupervised-learning.compute.dtu.dk.
Stars: ✭ 10 (+0%)
Mutual labels:  jupyter-notebook
Sphere Challenge
SPHERE Challenge: Activity Recognition with Multimodal Sensor Data
Stars: ✭ 9 (-10%)
Mutual labels:  jupyter-notebook
Algorithmic Trading Python
The repository for freeCodeCamp's YouTube course, Algorithmic Trading in Python
Stars: ✭ 846 (+8360%)
Mutual labels:  jupyter-notebook
Resnet
Tensorflow ResNet implementation on cifar10
Stars: ✭ 10 (+0%)
Mutual labels:  jupyter-notebook
Novel Twitter Anomalies Pydatalondon2016
Detect novel anomalies on Twitter
Stars: ✭ 9 (-10%)
Mutual labels:  jupyter-notebook
Changepoint Detection
Online Change-point Detection Algorithm for Multi-Variate Data: Applications on Human/Robot Demonstrations.
Stars: ✭ 10 (+0%)
Mutual labels:  jupyter-notebook
Awesome Ai Books
Some awesome AI related books and pdfs for learning and downloading, also apply some playground models for learning
Stars: ✭ 855 (+8450%)
Mutual labels:  jupyter-notebook
Meetupcityfinder
Stars: ✭ 10 (+0%)
Mutual labels:  jupyter-notebook
Lyrics Lab
CS109 Final Project
Stars: ✭ 9 (-10%)
Mutual labels:  jupyter-notebook
Notes Lsju Machine Learning
机器学习笔记
Stars: ✭ 852 (+8420%)
Mutual labels:  jupyter-notebook
Ml
Varios mini-projectos de ML
Stars: ✭ 10 (+0%)
Mutual labels:  jupyter-notebook
Scipyecosystem
An Introduction to the SciPy Ecosystem presentation
Stars: ✭ 9 (-10%)
Mutual labels:  jupyter-notebook
Movielens Recommender
Course project for Programing Machine Learnings Applications class
Stars: ✭ 10 (+0%)
Mutual labels:  jupyter-notebook
Headline analysis
Analyzing news headlines for fun and profit
Stars: ✭ 9 (-10%)
Mutual labels:  jupyter-notebook
Group meetings
Notes and ideas for MARL group meetings
Stars: ✭ 10 (+0%)
Mutual labels:  jupyter-notebook
Tensorflow Tutorial
Basics of Tensorflow
Stars: ✭ 10 (+0%)
Mutual labels:  jupyter-notebook
Deeplearningcameraapp
Deep Learning Capstone Project. Live camera app that can interpret number strings in real-world images.
Stars: ✭ 10 (+0%)
Mutual labels:  jupyter-notebook
Cds Ta Meetup
Competitive Data Science @ Tel Aviv Meetup
Stars: ✭ 10 (+0%)
Mutual labels:  jupyter-notebook

Python and Spark for Data Analysis

These are the IPython notebooks I used for a 4-day training course on Python and Spark for data science, given in December 2015 to a Data Minded client. The audience consisted of experienced data analysts, familiar with technologies like R and SPSS, but who had never used Python and had never worked on a Hadoop cluster.

The content is mildly redacted to remove all references to the actual client, but are otherwise unchanged.

Each day consisted of working through a series of IPython notebooks. Exercises are interspersed throughout. The last notebook of each day contains solutions to that day's exercises.

Objectives

The objectives of the training were to:

  • Learn the fundamentals of Python
  • Learn the fundamentals of its statistical and machine learning packages
  • Learn Apache Spark using Python
  • Learn how to apply these technologies in a live Hadoop cluster

Pre-requisites

Before the start of the course, we required the following software to be installed on students' laptops:

  • Anaconda 2.4.1 64-bit for Windows. The packages in this version of Anaconda included:
    • Python 2.7.11
    • IPython 4.0.1
    • NumPy 1.9.3
    • SciPy 0.16.0
    • Matplotlib 1.5.0
    • Pandas 0.17.1
    • Seaborn 0.6.0
    • Scikit-learn 0.17
  • Apache Spark 1.2.0. The version was chosen to match that in the client's production cluster, even though the latest version at the time of the course was 1.5.2
  • JDK 7u79.

Syllabus

The four days covered the following content.

Day 0: Fundamentals of Python

This day was intended for people with very limited programming experience and/or no Python experience. Day 0 was optional.

At the end of this day, the students were able to:

  • Start and run python programs interactively with python CLI
  • Use an IDE to write programs and execute them, including command line arguments
  • Create notebooks locally and on a server
  • Import libraries
  • Store data in variables and understand their reach
  • Know the standard operators
  • Control the flow of a program
  • Perform common string operations such as concatenation, substring, replace
  • Use the correct data structures
  • Use functions to structure your program

Day 1: Statistical and Machine Learning Packages

On Day 1, we discussed several of the powerful statistical and machine learning libraries in Python. It was purposely a very hands on introduction and we did not dive into the mathematics behind any of the algorithms.

At the end of this day, the students were able to:

  • Import and export data in csv
  • Use numpy/scipy to perform mathematical computations
  • Slice and dice data
  • Use pandas to wrangle data
  • Plot data and perform exploratory analysis
  • Use scikit-learn
  • Perform regression analysis in Python
  • Perform classification analysis in Python

Day 2: Apache Spark and Python

On the second day, we dove into Spark. We focused on the essential parts. After a brief introduction into Spark Core, we explored Spark SQL and Spark MLlib.

At the end of this day, the students were able to:

  • Understand the role of Spark and pyspark in the eco-system
  • Run spark locally from a shell
  • Run spark locally in IPython Notebooks
  • Do a word count on an input file
  • Load data in SparkSQL
  • Query data in SparkSQL
  • Use Spark MLlib to perform regression and classification analyses at scale

Day 3: Python and Apache Spark on a Cluster

In this last day, we set up a small Cloudera Hadoop cluster on AWS and explored how everything we had learned could be run in a cluster environment. The second half of the day was set aside for an open-ended project. Possible projects included:

  1. setting up a machine learning pipeline on data from the UCI Machine Learning Repository;
  2. implementing a machine learning algorithm using Spark Core;
  3. testing to what extent Spark running times scales linearly with data size.

At the end of this day, the students were able to

  • Run python scripts on the cluster from a shell and from ipython notebooks
  • Use Spark to read from and write to HDFS
  • Use SparkSQL to read data from and write data to Hive
  • Understand how YARN works
  • Submit spark jobs on the cluster
  • Use Spark, SparkSQL and Spark MLlib to run algorithms on large-scale data.
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].