Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → r-shekhar → Nyc Transport

r-shekhar / Nyc Transport

Licence: bsd-3-clause

A Unified Database of NYC transport (subway, taxi/Uber, and citibike) data.

Labels

jupyter-notebook data-science

Projects that are alternatives of or similar to Nyc Transport

Interactive machine learning

IPython widgets, interactive plots, interactive machine learning

Stars: ✭ 140 (-5.41%)

Mutual labels: jupyter-notebook, data-science

Machine learning for good

Machine learning fundamentals lesson in interactive notebooks

Stars: ✭ 142 (-4.05%)

Mutual labels: jupyter-notebook, data-science

Data augmentation for NLP

Stars: ✭ 2,761 (+1765.54%)

Mutual labels: jupyter-notebook, data-science

Youtube Like Predictor

YouTube Like Count Predictions using Machine Learning

Stars: ✭ 137 (-7.43%)

Mutual labels: jupyter-notebook, data-science

Tutorial Sessions for SciPy Con 2019

Stars: ✭ 142 (-4.05%)

Mutual labels: jupyter-notebook, data-science

Machine Learning And Data Science

This is a repository which contains all my work related Machine Learning, AI and Data Science. This includes my graduate projects, machine learning competition codes, algorithm implementations and reading material.

Stars: ✭ 137 (-7.43%)

Mutual labels: jupyter-notebook, data-science

The Raspberry Turk is a robot that can play chess—it's entirely open source, based on Raspberry Pi, and inspired by the 18th century chess playing machine, the Mechanical Turk.

Stars: ✭ 140 (-5.41%)

Mutual labels: jupyter-notebook, data-science

Seq2seq tutorial

Code For Medium Article "How To Create Data Products That Are Magical Using Sequence-to-Sequence Models"

Stars: ✭ 132 (-10.81%)

Mutual labels: jupyter-notebook, data-science

Stars: ✭ 145 (-2.03%)

Mutual labels: jupyter-notebook, data-science

Developer Cheatsheets

Stars: ✭ 145 (-2.03%)

Mutual labels: jupyter-notebook, data-science

Data Science Wg

SF Brigade's Data Science Working Group.

Stars: ✭ 135 (-8.78%)

Mutual labels: jupyter-notebook, data-science

Python Machine Learning Book

The "Python Machine Learning (1st edition)" book code repository and info resource

Stars: ✭ 11,428 (+7621.62%)

Mutual labels: jupyter-notebook, data-science

2016 Ml Contest

Machine learning contest - October 2016 TLE

Stars: ✭ 135 (-8.78%)

Mutual labels: jupyter-notebook, data-science

Python For Data Science

A blog for data analytics using data science technologies

Stars: ✭ 139 (-6.08%)

Mutual labels: jupyter-notebook, data-science

🐍💻📊 All material from the PyCon.DE 2018 Talk "Beyond Jupyter Notebooks - Building your own data science platform with Python & Docker" (incl. Slides, Video, Udemy MOOC & other References)

Stars: ✭ 135 (-8.78%)

Mutual labels: jupyter-notebook, data-science

This book serves as an introduction to a whole new way of thinking systematically about geographic data, using geographical analysis and computation to unlock new insights hidden within data.

Stars: ✭ 141 (-4.73%)

Mutual labels: jupyter-notebook, data-science

Data Science For Marketing Analytics

Achieve your marketing goals with the data analytics power of Python

Stars: ✭ 127 (-14.19%)

Mutual labels: jupyter-notebook, data-science

Griffon Data Science Virtual Machine

Stars: ✭ 128 (-13.51%)

Mutual labels: jupyter-notebook, data-science

Data Science Question Answer

A repo for data science related questions and answers

Stars: ✭ 2,000 (+1251.35%)

Mutual labels: jupyter-notebook, data-science

Principles and Techniques of Data Science, the textbook for Data 100 at UC Berkeley

Stars: ✭ 145 (-2.03%)

Mutual labels: jupyter-notebook, data-science

View All Similar Projects ➔

NYC-Transport Readme

This is a combined repository of all publicly available New York City transit datasets.

Taxi and Limousine Commission (TLC) Taxi trip Data
FOIA requested Uber trip data for portions of 2013-2015
Subway turnstile data from the Metropolitan Transit Authority (MTA)
Citibike system data

This repository contains code to download all the data, clean it, remove corrupted data, and produce a set of pandas dataframes, which are written to Parquet format files using Dask and Fastparquet.

These Parquet format files are repartitioned on disk with PySpark, and resulting files are queried with PySpark SQL and Dask to produce data science results in Jupyter notebooks.

Requirements

Python 3.4+
Beautiful Soup 4
Bokeh
Dask Distributed
FastParquet
Geopandas
Jupyter
Numba 0.29+
Palettable
PyArrow
PySpark 2.0.2+
Python-Snappy
Scikit-Learn
Seaborn

A tutorial on my blog shows how to set up an environment compatible with this analysis on Ubuntu. This tutorial has been tested locally and on Amazon EC2.

If you want to skip obtaining and processing the raw Taxi/Uber data into Parquet format, the processed dataset is available on Academic Torrents here.

Steps

Setup your conda environment with the modules above.

conda install -c conda-forge \
    beautifulsoup4 bokeh distributed fastparquet geopandas \
    jupyter numba palettable pyarrow python-snappy  \
    scikit-learn seaborn
conda install -c quasiben spark

Download the data in the 00_download_scripts directory
- ./make_directories.sh -- Alternatively you can create a raw_data directory elsewhere and symlink it.
- python download-subway-data.py (~ 10 GB)
- ./download-bike-data.sh (~7 GB)
- ./download-taxi-data.sh (~250 GB)
- ./download-uber-data.sh (~5 GB)
- ./decompress.sh
Convert the data to parquet format using scripts in 05_raw_to_dataframe. Times given are on a 4GHz i5-3570K (4 core) with fast SSD and 16GB memory.
- Adjust config.json to have correct input and output paths for your system
- python convert_bike_csv_to_parquet.py (~2 hours)
- python convert_subway_to_parquet.py (~2 hours)
- python convert_taxi_to_parquet.py (~32 hours)
Repartition and recompress the parquet files for efficient access using PySpark in 06_repartition. This is especially useful for later stages, where queries are performed on Amazon EC2 using a distributed Spark engine using files on S3.

Analysis

Analysis scripts and notebooks live in the 15_dataframe_analysis directory. Some require PySpark 2+, but most simply require Dask and Jupyter.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 148

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (0) 🔗