All Projects → ydataai → ydata-quality

ydataai / ydata-quality

Licence: MIT license
Data Quality assessment with one line of code

Programming Languages

Jupyter Notebook
11667 projects
python
139335 projects - #7 most used programming language
Makefile
30231 projects

Projects that are alternatives of or similar to ydata-quality

cognipy
In-memory Graph Database and Knowledge Graph with Natural Language Interface, compatible with Pandas
Stars: ✭ 31 (-90.03%)
Mutual labels:  pandas
datar
A Grammar of Data Manipulation in python
Stars: ✭ 142 (-54.34%)
Mutual labels:  pandas
UDACITY-Deep-Learning-Nanodegree-PROJECTS
These are the projects I did on my Udacity Deep Learning Nanodegree 🌟 💻 💻. 💥 🌈
Stars: ✭ 18 (-94.21%)
Mutual labels:  pandas
DataProfiler
What's in your data? Extract schema, statistics and entities from datasets
Stars: ✭ 843 (+171.06%)
Mutual labels:  pandas
pytd
Treasure Data Driver for Python
Stars: ✭ 15 (-95.18%)
Mutual labels:  pandas
ml-workflow-automation
Python Machine Learning (ML) project that demonstrates the archetypal ML workflow within a Jupyter notebook, with automated model deployment as a RESTful service on Kubernetes.
Stars: ✭ 44 (-85.85%)
Mutual labels:  pandas
xpandas
Universal 1d/2d data containers with Transformers functionality for data analysis.
Stars: ✭ 25 (-91.96%)
Mutual labels:  pandas
degiro-trading-tracker
Simplified tracking of your investments
Stars: ✭ 16 (-94.86%)
Mutual labels:  pandas
PandasVersusExcel
Python数据分析入门,数据分析师入门
Stars: ✭ 120 (-61.41%)
Mutual labels:  pandas
jcasts
Simple podcast MVP
Stars: ✭ 27 (-91.32%)
Mutual labels:  pandas
datasets
🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
Stars: ✭ 13,870 (+4359.81%)
Mutual labels:  pandas
Python-Data-Visualization
D-Lab's 3 hour introduction to data visualization with Python. Learn how to create histograms, bar plots, box plots, scatter plots, compound figures, and more, using matplotlib and seaborn.
Stars: ✭ 42 (-86.5%)
Mutual labels:  pandas
dataquest-guided-projects-solutions
My dataquest project solutions
Stars: ✭ 35 (-88.75%)
Mutual labels:  pandas
wax-ml
A Python library for machine-learning and feedback loops on streaming data
Stars: ✭ 36 (-88.42%)
Mutual labels:  pandas
DS-Cookbook101
A jupyter notebook having all most frequent used code snippet for daily data scienceoperations
Stars: ✭ 59 (-81.03%)
Mutual labels:  pandas
onelinerhub
2.5k code solutions with clear explanation @ onelinerhub.com
Stars: ✭ 645 (+107.4%)
Mutual labels:  pandas
obsplus
A Pandas-Centric ObsPy Expansion Pack
Stars: ✭ 28 (-91%)
Mutual labels:  pandas
gw2raidar
A log parsing website for Guild Wars 2 combat logs
Stars: ✭ 19 (-93.89%)
Mutual labels:  pandas
jupyter-django
Using Jupyter Notebook with Django: a presentation
Stars: ✭ 42 (-86.5%)
Mutual labels:  pandas
online-course-recommendation-system
Built on data from Pluralsight's course API fetched results. Works with model trained with K-means unsupervised clustering algorithm.
Stars: ✭ 31 (-90.03%)
Mutual labels:  pandas

YData Quality

ydata_quality is an open-source python library for assessing Data Quality throughout the multiple stages of a data pipeline development.

A holistic view of the data can only be captured through a look at data from multiple dimensions and ydata_quality evaluates it in a modular way wrapped into a single Data Quality engine. This repository contains the core python source scripts and walkthrough tutorials.

Quickstart

The source code is currently hosted on GitHub at: https://github.com/ydataai/ydata-quality

Binary installers for the latest released version are available at the Python Package Index (PyPI).

pip install ydata-quality

Comprehensive quality check in few lines of code

from ydata_quality import DataQuality
import pandas as pd

#Load in the data
df = pd.read_csv('./datasets/transformed/census_10k.csv')

# create a DataQuality object from the main class that holds all quality modules
dq = DataQuality(df=df)

# run the tests and outputs a summary of the quality tests
results = dq.evaluate()
Warnings:
	TOTAL: 5 warning(s)
	Priority 1: 1 warning(s)
	Priority 2: 4 warning(s)

Priority 1 - heavy impact expected:
	* [DUPLICATES - DUPLICATE COLUMNS] Found 1 columns with exactly the same feature values as other columns.
Priority 2 - usage allowed, limited human intelligibility:
	* [DATA RELATIONS - HIGH COLLINEARITY - NUMERICAL] Found 3 numerical variables with high Variance Inflation Factor (VIF>5.0). The variables listed in results are highly collinear with other variables in the dataset. These will make model explainability harder and potentially give way to issues like overfitting. Depending on your end goal you might want to remove the highest VIF variables.
	* [ERRONEOUS DATA - PREDEFINED ERRONEOUS DATA] Found 1960 ED values in the dataset.
	* [DATA RELATIONS - HIGH COLLINEARITY - CATEGORICAL] Found 10 categorical variables with significant collinearity (p-value < 0.05). The variables listed in results are highly collinear with other variables in the dataset and sorted descending according to propensity. These will make model explainability harder and potentially give way to issues like overfitting. Depending on your end goal you might want to remove variables following the provided order.
	* [DUPLICATES - EXACT DUPLICATES] Found 3 instances with exact duplicate feature values.

On top of the summary, you can retrieve a list of detected warnings for detailed inspection.

# retrieve a list of data quality warnings 
warnings = dq.get_warnings()

Examples

Here you can find walkthrough tutorials and examples to familiarize with different modules of ydata_quality

To dive into any focussed module, and to understand how they work, here are tutorial notebooks:

  1. Bias and Fairness
  2. Data Expectations
  3. Data Relations
  4. Drift Analysis
  5. Duplicates
  6. Labelling: Categoricals and Numericals
  7. Missings
  8. Erroneous Data

Contributing

We are open to collaboration! If you want to start contributing you only need to:

  1. Search for an issue in which you would like to work on. Issues for newcomers are labeled with good first issue.
  2. Create a PR solving the issue.
  3. We would review every PR and either accept or ask for revisions.

You can also join the discussions on the #data-quality channel on our Slack and request features/bug fixes by opening issues on our repository.

Support

For support in using this library, please join the #help Slack channel. The Slack community is very friendly and great about quickly answering questions about the use and development of the library. Click here to join our Slack community!

License

GNU General Public License v3.0

About

With ♥️ from YData Development team

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].