All Projects → SauceCat → Pydqc

SauceCat / Pydqc

Licence: mit
python automatic data quality check toolkit

Projects that are alternatives of or similar to Pydqc

Wassdistance
Approximating Wasserstein distances with PyTorch
Stars: ✭ 229 (-1.72%)
Mutual labels:  jupyter-notebook
Nn
🧑‍🏫 50! Implementations/tutorials of deep learning papers with side-by-side notes 📝; including transformers (original, xl, switch, feedback, vit, ...), optimizers (adam, adabelief, ...), gans(cyclegan, stylegan2, ...), 🎮 reinforcement learning (ppo, dqn), capsnet, distillation, ... 🧠
Stars: ✭ 5,720 (+2354.94%)
Mutual labels:  jupyter-notebook
Pyhessian
PyHessian is a Pytorch library for second-order based analysis and training of Neural Networks
Stars: ✭ 232 (-0.43%)
Mutual labels:  jupyter-notebook
Pystacknet
Stars: ✭ 232 (-0.43%)
Mutual labels:  jupyter-notebook
Mxnet The Straight Dope
An interactive book on deep learning. Much easy, so MXNet. Wow. [Straight Dope is growing up] ---> Much of this content has been incorporated into the new Dive into Deep Learning Book available at https://d2l.ai/.
Stars: ✭ 2,551 (+994.85%)
Mutual labels:  jupyter-notebook
Smt
Surrogate Modeling Toolbox
Stars: ✭ 233 (+0%)
Mutual labels:  jupyter-notebook
Aotodata
朱小五写文章涉及到的数据分析,爬虫,源数据
Stars: ✭ 232 (-0.43%)
Mutual labels:  jupyter-notebook
Web scraping with python
Python 入门爬虫和数据分析实战
Stars: ✭ 234 (+0.43%)
Mutual labels:  jupyter-notebook
Tensorflow 101
TensorFlow Tutorials
Stars: ✭ 2,565 (+1000.86%)
Mutual labels:  jupyter-notebook
My tech resources
List of tech resources future me and other Javascript/Ruby/Python/Elixir/Elm developers might find useful
Stars: ✭ 233 (+0%)
Mutual labels:  jupyter-notebook
Mattnet
MAttNet: Modular Attention Network for Referring Expression Comprehension
Stars: ✭ 232 (-0.43%)
Mutual labels:  jupyter-notebook
Relevant Search Book
Code and Examples for Relevant Search
Stars: ✭ 231 (-0.86%)
Mutual labels:  jupyter-notebook
Pandas Highcharts
Beautiful charting of pandas.DataFrame with Highcharts
Stars: ✭ 233 (+0%)
Mutual labels:  jupyter-notebook
Statannot
add statistical annotations (pvalue significance) on an existing boxplot generated by seaborn boxplot
Stars: ✭ 228 (-2.15%)
Mutual labels:  jupyter-notebook
Socceraction
Convert existing soccer event stream data to SPADL and value player actions
Stars: ✭ 234 (+0.43%)
Mutual labels:  jupyter-notebook
Learn Statistical Learning Method
Implementation of Statistical Learning Method, Second Edition.《统计学习方法》第二版,算法实现。
Stars: ✭ 228 (-2.15%)
Mutual labels:  jupyter-notebook
Datasets
source{d} datasets ("big code") for source code analysis and machine learning on source code
Stars: ✭ 231 (-0.86%)
Mutual labels:  jupyter-notebook
Datavisualization
Tutorials on visualizing data using python packages like bokeh, plotly, seaborn and igraph
Stars: ✭ 234 (+0.43%)
Mutual labels:  jupyter-notebook
Jupyterwith
declarative and reproducible Jupyter environments - powered by Nix
Stars: ✭ 235 (+0.86%)
Mutual labels:  jupyter-notebook
Whilemymcmcgentlysamples
my blog
Stars: ✭ 232 (-0.43%)
Mutual labels:  jupyter-notebook

pydqc

PyPI version

Python automatic data quality check toolkit. Aims to relieve the pain of writing tedious codes for general data understanding by:

  • Automatically generate data summary report, which contains useful statistical information for each column in a data table. (useful for general data understanding)
  • Automatically summarize the statistical difference between two data tables. (useful for comparing training set with test set, comparing the same data table from two different snapshot dates, etc.)
  • But still need some help from human for data types inferring. 🙈

Motivation

"Today I don't feel like doing anything about data quality check, I just wanna lay in my bed. Don't feel like writing any tedious codes. So build a tool runs on its own." 🎤 🎵 🎶
-modified The Lazy Song

Install pydqc

  • install py2nb
  • install dependents pip install -r requirements.txt
  • install pydqc
git clone https://github.com/SauceCat/pydqc.git
cd pydqc
python setup.py install

How does it work?

For an input data table (pandas dataframe):

Step 1: data schema

  • function: pydqc.infer_schema.infer_schema(data, fname, output_root='', sample_size=1.0, type_threshold=0.5, n_jobs=1, base_schema=None)
    Infer data types for each column. pydqc recognizes four basic data types, including 'key', 'str', 'date', 'numeric'.

    • 'key': column that doesn't have concrete meaning itself, but acts as 'key' to link with other tables.
    • 'str': categorical column
    • 'date': datetime column
    • 'numeric': numeric column

    After inferring, an excel file named 'data_schema_XXX.xlsx' (XXX here represents the 'fname' parameter) is generated. We should check the generated file and modify the 'type' column when it is necessary (when the infer_schema function makes some mistakes). But it is easy because we can do the modification by selecting from a drop down list.

    You can also modify the 'include' column to exclude some features for further checking.

    For this version, pydqc is not able to infer the 'key' type, so it always needs human modification.
    After necessary modification, it is better to save the schema as 'data_schema_XXX_mdf.xlsx' or with some other names different from the original one.

Step 2 (option 1): data summary

  • function: pydqc.data_summary.data_summary(table_schema, table, fname, sample_size=1.0, sample_rows=100, output_root='', n_jobs=1)
    Summary basic statistical information for each column based on the provided data type.

    • 'key' and 'str': sample value, rate of nan values, number of unique values, top 10 value count.
      example output:

    the only different between summary for 'key' and 'str' is pydqc doesn't do sampling for 'key' columns. (check 'sample_size' parameter)

    • 'date': sample value, rate of nan values, number of unique values, minimum numeric value, mean numeric value, median numeric value, maximum numeric value, maximum date value, minimum date value, distribution graph for numeric values.
      example output:

    numeric value for 'date' column is calculated as the time difference between the date value and today in months.

    • 'numeric': sample value, rate of nan values, number of unique values, minimum value, mean value, median value, maximum value, distribution graph (log10 scale automatically when absolute maximum value is larger than 1M).
      example output:

    You can also turn the whole data summmary process into a jupyter notebook by function data_summary_notebook()

Step 2 (option 2): data compare

  • function: data_compare(table1, table2, schema1, schema2, fname, sample_size=1.0, output_root='', n_jobs=1)
    Compare statistical characteristics of the same columns between two different tables based on the provided data type. (It might be useful when we want to compare training set with test set, or sample table from two different snapshot dates)

    • 'key': compare sample value, rate of nan values, number of unique values, size of intersection set, size of set only in table1, size of set only in table2, venn graph.
      example output:

    • 'str': compare sample value, rate of nan values, number of unique values, size of intersection set, size of set only in table1, size of set only in table2, top 10 value counts.
      example output:

    • 'date': compare sample value, rate of nan values, number of unique values, minimum numeric value, mean numeric value, median numeric value, maximum numeric value, maximum date value, minimum date value, distribution graph for numeric values.
      example output:

    numeric value for 'date' column is calculated as the time difference between the date value and today in months.

    • 'numeric': compare sample value, rate of nan values, number of unique values, minimum value, mean value, median value, maximum value,distribution graph.
      example output:

    You can also turn the whole data compare process into a jupyter notebook by function data_compare_notebook()

    • example output for data compare: data compare report
      Inside the excel report, there is a worksheet called 'summary'. This worksheet summarizes the basic information regarding the comparing result, including a 'corr' field that indicates correlation of the same column between different tables.

      • key: 'corr' = rate of overlap
      • str: 'corr' = Spearman rank-order correlation coefficient between not-nan value counts
      • numeric and date: 'corr' = Spearman rank-order correlation coefficient between not-nan value counts (when the number of unique values is small) or between not-nan value distribution (use 100-bin histogram)
    • example output for data compare notebook: data compare notebook

Step 2 (option 3): data consist

  • function: data_consist(table1, table2, key1, key2, schema1, schema2, fname, sample_size=1.0, output_root='', keep_images=False, n_jobs=1)
    Check consistency of the same columns between two different tables by merging tables on the provided keys. (It might be useful when we want to compare training set with test set, or sample table from two different snapshot dates)

    • 'key': same as data_compare for key type

    • 'str': check whether two values of the same key is the same between two tables.
      example output:

    • 'numeric': calculate a Spearman rank-order correlation coefficient between values of the same key between two tables, calculate the minimum, mean, median, maximum difference rate between two values.
      example output:

    You can also turn the whole data compare process into a jupyter notebook by function data_consist_notebook()

    • example output for data consist: data consist report
      Inside the excel report, there is a worksheet called 'summary'. This worksheet summarizes the basic information regarding the consistency checking result, including a 'corr' field that indicates correlation of the same column between different tables.
      • key: 'corr' = rate of overlap
      • str: 'corr' = rate of consistency
      • numeric and date: 'corr' = Spearman rank-order correlation coefficient between not-nan value pairs
    • example output for data consist notebook: data consist notebook

Documentation

For details about the ideas, please refer to Introducing pydqc.
For description about the functions and parameters, please refer to pydqc functions and parameters.
For test and demo, please refer to https://github.com/SauceCat/pydqc/tree/master/test.
For example outputs, please refer to https://github.com/SauceCat/pydqc/tree/master/test/output.

Contribution

If you have other ideas for general automatic data quality check, push requests are always welcome! 🙋

To-dos

  • [ ] test with python3
  • [ ] unit test
  • [ ] documentation with sphinx
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].