All Projects → kjam → Data Cleaning 101

kjam / Data Cleaning 101

Data Cleaning Libraries with Python

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Data Cleaning 101

Cracking The Data Science Interview
A Collection of Cheatsheets, Books, Questions, and Portfolio For DS/ML Interview Prep
Stars: ✭ 672 (+176.54%)
Mutual labels:  jupyter-notebook, data-wrangling
Course20
Deep Learning for Coders, 2020, the website
Stars: ✭ 468 (+92.59%)
Mutual labels:  jupyter-notebook, teaching
Jupyter Edu Book
Teaching and Learning with Jupyter
Stars: ✭ 325 (+33.74%)
Mutual labels:  jupyter-notebook, teaching
Web Database Analytics
Web scrapping and related analytics using Python tools
Stars: ✭ 175 (-27.98%)
Mutual labels:  jupyter-notebook, data-wrangling
Nbgrader
A system for assigning and grading notebooks
Stars: ✭ 1,000 (+311.52%)
Mutual labels:  jupyter-notebook, teaching
Lab teaching 2016
Repository for materials/codes from Kording lab teaching 2016
Stars: ✭ 16 (-93.42%)
Mutual labels:  jupyter-notebook, teaching
Python Course
Tutorial and introduction into programming with Python for the humanities and social sciences
Stars: ✭ 370 (+52.26%)
Mutual labels:  jupyter-notebook, teaching
Optimus
🚚 Agile Data Preparation Workflows made easy with dask, cudf, dask_cudf and pyspark
Stars: ✭ 986 (+305.76%)
Mutual labels:  jupyter-notebook, data-wrangling
Python Ecology Lesson
Data Analysis and Visualization in Python for Ecologists
Stars: ✭ 116 (-52.26%)
Mutual labels:  jupyter-notebook, data-wrangling
Jupyterhub Deploy Teaching
Reference deployment of JupyterHub and nbgrader on a single server
Stars: ✭ 194 (-20.16%)
Mutual labels:  jupyter-notebook, teaching
Tacotron pytorch
PyTorch implementation of Tacotron speech synthesis model.
Stars: ✭ 242 (-0.41%)
Mutual labels:  jupyter-notebook
Deeplearningcoursecodes
Notes, Codes, and Tutorials for the Deep Learning Course <which I taught at ChinaHadoop>
Stars: ✭ 241 (-0.82%)
Mutual labels:  jupyter-notebook
Normalizing Flows Tutorial
Tutorial on normalizing flows.
Stars: ✭ 243 (+0%)
Mutual labels:  jupyter-notebook
Deeplearningcoursecodes
Stars: ✭ 243 (+0%)
Mutual labels:  jupyter-notebook
Neural Ordinary Differential Equations
Sample implementation of Neural Ordinary Differential Equations
Stars: ✭ 242 (-0.41%)
Mutual labels:  jupyter-notebook
Megnet
Graph Networks as a Universal Machine Learning Framework for Molecules and Crystals
Stars: ✭ 242 (-0.41%)
Mutual labels:  jupyter-notebook
Loss toolbox Pytorch
PyTorch Implementation of Focal Loss and Lovasz-Softmax Loss
Stars: ✭ 240 (-1.23%)
Mutual labels:  jupyter-notebook
Deeppicar
Deep Learning Autonomous Car based on Raspberry Pi, SunFounder PiCar-V Kit, TensorFlow, and Google's EdgeTPU Co-Processor
Stars: ✭ 242 (-0.41%)
Mutual labels:  jupyter-notebook
Jetcam
Easy to use Python camera interface for NVIDIA Jetson
Stars: ✭ 242 (-0.41%)
Mutual labels:  jupyter-notebook
Kdepy
Kernel Density Estimation in Python
Stars: ✭ 244 (+0.41%)
Mutual labels:  jupyter-notebook

Data Cleaning 101

Welcome to the code repository for Practical Data Cleaning with Python! This is a two-day training offered through Safari with O'Reilly media. You can sign up by searching for the course on Safari.

This course aims to give you a practical overview of data cleaning and validation libraries and methods in Python. Since we only have 6 hours, it can't go massively in-depth into any one library or tool, but I have tried to include useful tools I have found in my work and incorporate a mixture of the munging and testing I have seen in my own and others workflows.

If you have a suggestion for another library or additional topic, feel free to drop me a line :)

Installation

These lessons has been tested for Python 3.4 and Python 3.6 and primarily uses the latest release of each library, except where versions are pinned. You likely can run most of the code with older releases, but if you run into an issue, try upgrading the library in question first.

pip install -r install_reqs.txt

I believe this will also work with Conda, although I am less familiar with Conda so please report issues! (special thanks to @blue_hacker for this fix!)

$ conda create -n dataclean --copy python=3.6
$ source activate dataclean
$ pip install -r install_reqs.txt

In addition, you will need to install sqlite3 or make changes to the second day case study with a connection string to your database of choice. more info

If you want to visualize graphs using Dask, you will need to install Graphviz, which has special requirements on all platforms. For linux, it is usually available via the system package library (apt, yum). For other platforms, you might need to use a special installer. It is also available via conda install graphviz and pip install graphviz, but these might not include all necessary dependencies for your OS. For best results, search for your OS and "install graphviz and dependencies" and follow a recent article on setup.

Repository structure

Each day coincides with a particular notebook folder. For day one, we will use cleaning-notebooks. Day two will focus on validation-notebooks. The data folder holds data we will use throughout the course. The queue_example.py file is used in the day two case study.

Python2 v. Python3

This repository has been built with Python 3. If you are using Python 2 and need help porting some logic or finding alternatives, please let me know and I will try and help. :)

Corrections?

If you find any issues in these code examples, feel free to submit an Issue or Pull Request. I appreciate your input!

Questions?

Reach out to @kjam on Twitter or GitHub. @kjam is also often on freenode. :)

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].