Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → SauceCat → Pydqc

SauceCat / Pydqc

Licence: mit

python automatic data quality check toolkit

Labels

jupyter-notebook

Projects that are alternatives of or similar to Pydqc

Wassdistance

Approximating Wasserstein distances with PyTorch

Stars: ✭ 229 (-1.72%)

Mutual labels: jupyter-notebook

🧑‍🏫 50! Implementations/tutorials of deep learning papers with side-by-side notes 📝; including transformers (original, xl, switch, feedback, vit, ...), optimizers (adam, adabelief, ...), gans(cyclegan, stylegan2, ...), 🎮 reinforcement learning (ppo, dqn), capsnet, distillation, ... 🧠

Stars: ✭ 5,720 (+2354.94%)

Mutual labels: jupyter-notebook

Pyhessian

PyHessian is a Pytorch library for second-order based analysis and training of Neural Networks

Stars: ✭ 232 (-0.43%)

Mutual labels: jupyter-notebook

Pystacknet

Stars: ✭ 232 (-0.43%)

Mutual labels: jupyter-notebook

Mxnet The Straight Dope

An interactive book on deep learning. Much easy, so MXNet. Wow. [Straight Dope is growing up] ---> Much of this content has been incorporated into the new Dive into Deep Learning Book available at https://d2l.ai/.

Stars: ✭ 2,551 (+994.85%)

Mutual labels: jupyter-notebook

Smt

Surrogate Modeling Toolbox

Stars: ✭ 233 (+0%)

Mutual labels: jupyter-notebook

Aotodata

朱小五写文章涉及到的数据分析，爬虫，源数据

Stars: ✭ 232 (-0.43%)

Mutual labels: jupyter-notebook

Web scraping with python

Python 入门爬虫和数据分析实战

Stars: ✭ 234 (+0.43%)

Mutual labels: jupyter-notebook

Tensorflow 101

TensorFlow Tutorials

Stars: ✭ 2,565 (+1000.86%)

Mutual labels: jupyter-notebook

My tech resources

List of tech resources future me and other Javascript/Ruby/Python/Elixir/Elm developers might find useful

Stars: ✭ 233 (+0%)

Mutual labels: jupyter-notebook

Mattnet

MAttNet: Modular Attention Network for Referring Expression Comprehension

Stars: ✭ 232 (-0.43%)

Mutual labels: jupyter-notebook

Relevant Search Book

Code and Examples for Relevant Search

Stars: ✭ 231 (-0.86%)

Mutual labels: jupyter-notebook

Pandas Highcharts

Beautiful charting of pandas.DataFrame with Highcharts

Stars: ✭ 233 (+0%)

Mutual labels: jupyter-notebook

Statannot

add statistical annotations (pvalue significance) on an existing boxplot generated by seaborn boxplot

Stars: ✭ 228 (-2.15%)

Mutual labels: jupyter-notebook

Socceraction

Convert existing soccer event stream data to SPADL and value player actions

Stars: ✭ 234 (+0.43%)

Mutual labels: jupyter-notebook

Learn Statistical Learning Method

Implementation of Statistical Learning Method, Second Edition.《统计学习方法》第二版，算法实现。

Stars: ✭ 228 (-2.15%)

Mutual labels: jupyter-notebook

Datasets

source{d} datasets ("big code") for source code analysis and machine learning on source code

Stars: ✭ 231 (-0.86%)

Mutual labels: jupyter-notebook

Datavisualization

Tutorials on visualizing data using python packages like bokeh, plotly, seaborn and igraph

Stars: ✭ 234 (+0.43%)

Mutual labels: jupyter-notebook

Jupyterwith

declarative and reproducible Jupyter environments - powered by Nix

Stars: ✭ 235 (+0.86%)

Mutual labels: jupyter-notebook

Whilemymcmcgentlysamples

my blog

Stars: ✭ 232 (-0.43%)

Mutual labels: jupyter-notebook

View All Similar Projects ➔

pydqc

Python automatic data quality check toolkit. Aims to relieve the pain of writing tedious codes for general data understanding by:

Automatically generate data summary report, which contains useful statistical information for each column in a data table. (useful for general data understanding)
Automatically summarize the statistical difference between two data tables. (useful for comparing training set with test set, comparing the same data table from two different snapshot dates, etc.)
But still need some help from human for data types inferring. 🙈

Motivation

"Today I don't feel like doing anything about data quality check, I just wanna lay in my bed. Don't feel like writing any tedious codes. So build a tool runs on its own." 🎤 🎵 🎶
-modified The Lazy Song

Install pydqc

install py2nb
install dependents pip install -r requirements.txt
install pydqc

git clone https://github.com/SauceCat/pydqc.git
cd pydqc
python setup.py install

How does it work?

For an input data table (pandas dataframe):

Step 1: data schema

function: pydqc.infer_schema.infer_schema(data, fname, output_root='', sample_size=1.0, type_threshold=0.5, n_jobs=1, base_schema=None)
Infer data types for each column. pydqc recognizes four basic data types, including 'key', 'str', 'date', 'numeric'.
- 'key': column that doesn't have concrete meaning itself, but acts as 'key' to link with other tables.
- 'str': categorical column
- 'date': datetime column
- 'numeric': numeric column
After inferring, an excel file named 'data_schema_XXX.xlsx' (XXX here represents the 'fname' parameter) is generated. We should check the generated file and modify the 'type' column when it is necessary (when the infer_schema function makes some mistakes). But it is easy because we can do the modification by selecting from a drop down list.

You can also modify the 'include' column to exclude some features for further checking.

For this version, pydqc is not able to infer the 'key' type, so it always needs human modification.
After necessary modification, it is better to save the schema as 'data_schema_XXX_mdf.xlsx' or with some other names different from the original one.
- example output for infer_schema: raw schema, modified schema

Step 2 (option 1): data summary

function: pydqc.data_summary.data_summary(table_schema, table, fname, sample_size=1.0, sample_rows=100, output_root='', n_jobs=1)
Summary basic statistical information for each column based on the provided data type.
- 'key' and 'str': sample value, rate of nan values, number of unique values, top 10 value count.
  example output:
the only different between summary for 'key' and 'str' is pydqc doesn't do sampling for 'key' columns. (check 'sample_size' parameter)
- 'date': sample value, rate of nan values, number of unique values, minimum numeric value, mean numeric value, median numeric value, maximum numeric value, maximum date value, minimum date value, distribution graph for numeric values.
  example output:
numeric value for 'date' column is calculated as the time difference between the date value and today in months.
- 'numeric': sample value, rate of nan values, number of unique values, minimum value, mean value, median value, maximum value, distribution graph (log10 scale automatically when absolute maximum value is larger than 1M).
  example output:
You can also turn the whole data summmary process into a jupyter notebook by function data_summary_notebook()
- example output for data summary: data summary report
- example output for data summary notebook: data summary notebook

Step 2 (option 2): data compare

function: data_compare(table1, table2, schema1, schema2, fname, sample_size=1.0, output_root='', n_jobs=1)
Compare statistical characteristics of the same columns between two different tables based on the provided data type. (It might be useful when we want to compare training set with test set, or sample table from two different snapshot dates)
- 'key': compare sample value, rate of nan values, number of unique values, size of intersection set, size of set only in table1, size of set only in table2, venn graph.
  example output:
- 'str': compare sample value, rate of nan values, number of unique values, size of intersection set, size of set only in table1, size of set only in table2, top 10 value counts.
  example output:
- 'date': compare sample value, rate of nan values, number of unique values, minimum numeric value, mean numeric value, median numeric value, maximum numeric value, maximum date value, minimum date value, distribution graph for numeric values.
  example output:
numeric value for 'date' column is calculated as the time difference between the date value and today in months.
- 'numeric': compare sample value, rate of nan values, number of unique values, minimum value, mean value, median value, maximum value,distribution graph.
  example output:
You can also turn the whole data compare process into a jupyter notebook by function data_compare_notebook()
- example output for data compare: data compare report
  Inside the excel report, there is a worksheet called 'summary'. This worksheet summarizes the basic information regarding the comparing result, including a 'corr' field that indicates correlation of the same column between different tables.
  - key: 'corr' = rate of overlap
  - str: 'corr' = Spearman rank-order correlation coefficient between not-nan value counts
  - numeric and date: 'corr' = Spearman rank-order correlation coefficient between not-nan value counts (when the number of unique values is small) or between not-nan value distribution (use 100-bin histogram)
- example output for data compare notebook: data compare notebook

Step 2 (option 3): data consist

function: data_consist(table1, table2, key1, key2, schema1, schema2, fname, sample_size=1.0, output_root='', keep_images=False, n_jobs=1)
Check consistency of the same columns between two different tables by merging tables on the provided keys. (It might be useful when we want to compare training set with test set, or sample table from two different snapshot dates)
- 'key': same as data_compare for key type
- 'str': check whether two values of the same key is the same between two tables.
  example output:
- 'numeric': calculate a Spearman rank-order correlation coefficient between values of the same key between two tables, calculate the minimum, mean, median, maximum difference rate between two values.
  example output:
You can also turn the whole data compare process into a jupyter notebook by function data_consist_notebook()
- example output for data consist: data consist report
  Inside the excel report, there is a worksheet called 'summary'. This worksheet summarizes the basic information regarding the consistency checking result, including a 'corr' field that indicates correlation of the same column between different tables.
  - key: 'corr' = rate of overlap
  - str: 'corr' = rate of consistency
  - numeric and date: 'corr' = Spearman rank-order correlation coefficient between not-nan value pairs
- example output for data consist notebook: data consist notebook

Documentation

For details about the ideas, please refer to Introducing pydqc.
For description about the functions and parameters, please refer to pydqc functions and parameters.
For test and demo, please refer to https://github.com/SauceCat/pydqc/tree/master/test.
For example outputs, please refer to https://github.com/SauceCat/pydqc/tree/master/test/output.

Contribution

If you have other ideas for general automatic data quality check, push requests are always welcome! 🙋

To-dos

[ ] test with python3
[ ] unit test
[ ] documentation with sphinx

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 233

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (24) 🔗