All Projects → UrbanInstitute → Pyspark Tutorials

UrbanInstitute / Pyspark Tutorials

Licence: gpl-3.0
Code snippets and tutorials for working with social science data in PySpark

Projects that are alternatives of or similar to Pyspark Tutorials

Public plstm
Phased LSTM
Stars: ✭ 298 (-0.67%)
Mutual labels:  jupyter-notebook
Notebook As Pdf
Save Jupyter Notebooks as PDF
Stars: ✭ 290 (-3.33%)
Mutual labels:  jupyter-notebook
Spyder Notebook
Jupyter notebook integration with Spyder
Stars: ✭ 298 (-0.67%)
Mutual labels:  jupyter-notebook
Nbsphinx
📒 Sphinx source parser for Jupyter notebooks
Stars: ✭ 297 (-1%)
Mutual labels:  jupyter-notebook
Nuscenes Devkit
Devkit for the public 2019 Lyft Level 5 AV Dataset (fork of https://github.com/nutonomy/nuscenes-devkit)
Stars: ✭ 299 (-0.33%)
Mutual labels:  jupyter-notebook
Deeplearningnotes
机器学习和量化分析学习进行中
Stars: ✭ 298 (-0.67%)
Mutual labels:  jupyter-notebook
Nerf
Code release for NeRF (Neural Radiance Fields)
Stars: ✭ 4,062 (+1254%)
Mutual labels:  jupyter-notebook
Playing Card Detection
Stars: ✭ 302 (+0.67%)
Mutual labels:  jupyter-notebook
Basic Mathematics For Machine Learning
The motive behind Creating this repo is to feel the fear of mathematics and do what ever you want to do in Machine Learning , Deep Learning and other fields of AI
Stars: ✭ 300 (+0%)
Mutual labels:  jupyter-notebook
Capsule net pytorch
Readable implementation of a Capsule Network as described in "Dynamic Routing Between Capsules" [Hinton et. al.]
Stars: ✭ 301 (+0.33%)
Mutual labels:  jupyter-notebook
Ipyvega
IPython/Jupyter notebook module for Vega and Vega-Lite
Stars: ✭ 297 (-1%)
Mutual labels:  jupyter-notebook
Introduction To Machine Learning
Stars: ✭ 199 (-33.67%)
Mutual labels:  jupyter-notebook
Car Finding Lane Lines
Finding Lane Lines using Python and OpenCV
Stars: ✭ 299 (-0.33%)
Mutual labels:  jupyter-notebook
Hiecoattenvqa
Stars: ✭ 298 (-0.67%)
Mutual labels:  jupyter-notebook
Ai notes
machine learning/artificial intelligence notes
Stars: ✭ 301 (+0.33%)
Mutual labels:  jupyter-notebook
Pycaret
An open-source, low-code machine learning library in Python
Stars: ✭ 4,594 (+1431.33%)
Mutual labels:  jupyter-notebook
Reactors
Content for Microsoft Reactor Workshops
Stars: ✭ 299 (-0.33%)
Mutual labels:  jupyter-notebook
Pydataroad
open source for wechat-official-account (ID: PyDataLab)
Stars: ✭ 302 (+0.67%)
Mutual labels:  jupyter-notebook
Scikit Learn Videos
Jupyter notebooks from the scikit-learn video series
Stars: ✭ 3,254 (+984.67%)
Mutual labels:  jupyter-notebook
26 Weeks Of Data Science
Email Newsletter
Stars: ✭ 299 (-0.33%)
Mutual labels:  jupyter-notebook

pyspark-tutorials

Code snippets and tutorials for working with social science data in PySpark. Note that each .ipynb file can be downloaded and the code blocks executed or experimented with directly using a Jupyter (formerly IPython) notebook, or each one can be displayed in your browser as markdown text just by clicking on it.

Spark Social Science Manual

The tutorials included in this repository are geared towards social scientists and policy researchers that want to undertake research using "big data" sets. A manual to accompany these tutorials is linked below. The objective of the manual is to provide social scientists with a brief overview of the distributed computing solution developed by The Urban Institute's Research Programming Team, and of the changes in how researchers manage and analyze data required by this computing environment.

Spark Social Science Manual

  1. If you're new to Python entirely, consider trying an intro tutorial first. Python is a language that stresses readability of code, so it won't be too difficult to dive right in. This is one good interactive tutorial.

  2. After that, or if you're already comfortable with Python basics, get started with pySpark with these two lessons. They will assume you are comfortable with what Python code looks like and in general how it works, and lay out some things you will need to know to understand the other lessons.

    Basics 1

    • Reading and writing data on S3
    • Handling column data types
    • Basic data exploration and describing
    • Renaming columns

    Basics 2

    • How pySpark processes commands - lazy computing
    • Persisting and unpersisting
    • Timing operations
  3. Basic data tasks are covered in the following guides. Note that these are not intended to be comprehensive! They cover many of the things that are most common, but others may require you to look them up or experiment. Hopefully this framework gives you enough to get started.

    Merging Data

    • Using unionAll to stack rows by matching columns
    • Using join to merge columns by matching specific row values

    Missing Values

    • Handling null values on loading
    • Counting null values
    • Dropping null values
    • Replacing null values

    Moving Average Imputation

    • Using pySpark window functions
    • Calculating a moving average
    • Imputing missing values

    Pivoting/Reshaping

    • Using groupBy to organize data
    • Pivoting data with an aggregation
    • Reshaping from long to wide without aggregation

    Resampling

    • Upsampling data based on a date column
    • Using datetime objects

    Subsetting

    • Filtering data based on criteria
    • Taking a randomized sample

    Summary Statistics

    • Using describe
    • Adding additional aggregations to describe output

    Graphing

    • Aggregating to use Matplotlib and Pandas
  4. The pySpark bootstrap used by the Urban Institute to start a cluster on Amazon Web Services only installs a handful of Python modules. If you need others for your work, or specfic versions, this tutorial explains how to get them. It uses only standard Python libraries, and is therefore not specific to the pySpark environment:

    Installing Python Modules

    • Using the pip module for Python packages
  5. And finally, now that Spark 2.0 is deployed to Amazon Web Services development has begun on OLS and GLM tutorials, which will be uploaded when complete. Introduction to GLM

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].