All Projects → mfcabrera → hooqu

mfcabrera / hooqu

Licence: other
hooqu is a library built on top of Pandas-like Dataframes for defining "unit tests for data". This is a spiritual port of Apache Deequ to Python

Programming Languages

python
139335 projects - #7 most used programming language
Makefile
30231 projects

Projects that are alternatives of or similar to hooqu

datatile
A library for managing, validating, summarizing, and visualizing data.
Stars: ✭ 419 (+2364.71%)
Mutual labels:  data-quality-checks, data-quality
NBi
NBi is a testing framework (add-on to NUnit) for Business Intelligence and Data Access. The main goal of this framework is to let users create tests with a declarative approach based on an Xml syntax. By the means of NBi, you don't need to develop C# or Java code to specify your tests! Either, you don't need Visual Studio or Eclipse to compile y…
Stars: ✭ 102 (+500%)
Mutual labels:  data-quality-checks, data-quality
Data-Quality-Analysis
The PEDSnet Data Quality Assessment Toolkit (OMOP CDM)
Stars: ✭ 19 (+11.76%)
Mutual labels:  data-quality-checks, data-quality
re-data
re_data - fix data issues before your users & CEO would discover them 😊
Stars: ✭ 955 (+5517.65%)
Mutual labels:  data-quality-checks, data-quality
Django-Data-quality-system
数据治理、数据质量检核/监控平台(Django+jQuery+MySQL)
Stars: ✭ 143 (+741.18%)
Mutual labels:  data-quality-checks, data-quality
Applied Ml
📚 Papers & tech blogs by companies sharing their work on data science & machine learning in production.
Stars: ✭ 17,824 (+104747.06%)
Mutual labels:  data-quality
penguin-datalayer-collect
A data layer quality monitoring and validation module, this solution is part of the Raft Suite ecosystem.
Stars: ✭ 19 (+11.76%)
Mutual labels:  data-quality
Great expectations
Always know what to expect from your data.
Stars: ✭ 5,808 (+34064.71%)
Mutual labels:  data-quality
DataQualityDashboard
A tool to help improve data quality standards in observational data science.
Stars: ✭ 62 (+264.71%)
Mutual labels:  data-quality
versatile-data-kit
Versatile Data Kit (VDK) is an open source framework that enables anybody with basic SQL or Python knowledge to create their own data pipelines.
Stars: ✭ 144 (+747.06%)
Mutual labels:  data-quality
soda-spark
Soda Spark is a PySpark library that helps you with testing your data in Spark Dataframes
Stars: ✭ 58 (+241.18%)
Mutual labels:  data-quality
leila
Librería para la evaluación de calidad de datos, e interacción con el portal de datos.gov.co
Stars: ✭ 56 (+229.41%)
Mutual labels:  data-quality
ohsome-quality-analyst
Data quality estimations for OpenStreetMap
Stars: ✭ 28 (+64.71%)
Mutual labels:  data-quality
dqlab-career-track
A collection of scripts written to complete DQLab Data Analyst Career Track 📊
Stars: ✭ 53 (+211.76%)
Mutual labels:  data-quality
Pandas Profiling
Create HTML profiling reports from pandas DataFrame objects
Stars: ✭ 8,329 (+48894.12%)
Mutual labels:  data-quality
osm-data-classification
Migrated to: https://gitlab.com/Oslandia/osm-data-classification
Stars: ✭ 23 (+35.29%)
Mutual labels:  data-quality
qamd
QAMyData, a data quality assurance tool for SPSS, STATA, SAS and CSV files.
Stars: ✭ 16 (-5.88%)
Mutual labels:  data-quality
contessa
Easy way to define, execute and store quality rules for your data.
Stars: ✭ 17 (+0%)
Mutual labels:  data-quality
check-engine
Data validation library for PySpark 3.0.0
Stars: ✭ 29 (+70.59%)
Mutual labels:  data-quality
great expectations action
A GitHub Action that makes it easy to use Great Expectations to validate your data pipelines in your CI workflows.
Stars: ✭ 66 (+288.24%)
Mutual labels:  data-quality

Hooqu - Unit Tests for Data

https://travis-ci.com/mfcabrera/hooqu.svg?token=pq89mpsBBBTg11hAgCHH&branch=master Documentation Status Updates

Documentation: https://hooqu.readthedocs.io

Source Code: https://github.com/mfcabrera/hooqu


Hooqu is a library built on top of Pandas dataframes for defining "unit tests for data", which measure data quality datasets.

Hooqu is a "spiritual" Python port of Apache Deequ and is currently in an experimental state. I am happy to receive feedback and contributions.

The main motivation of Hooqu is to enable data science projects to discover the quality of their input/output data using a similar API to the on found in Deequ, allowing to share the same vocabulary of checks between different teams.

Install

Hooqu requires Pandas >= 1.0 and Python >= 3.7. To install via pip use:

pip install hooqu

Quick Start

import pandas as pd

# data to validate
df = pd.DataFrame(
       [
           (1, "Thingy A", "awesome thing.", "high", 0),
           (2, "Thingy B", "available at http://thingb.com", None, 0),
           (3, None, None, "low", 5),
           (4, "Thingy D", "checkout https://thingd.ca", "low", 10),
           (5, "Thingy E", None, "high", 12),
       ],
       columns=["id", "productName", "description", "priority", "numViews"]
)

Checks we want to perform:

  • there are 5 rows in total
  • values of the id attribute are never Null/None and unique
  • values of the productName attribute are never null/None
  • the priority attribute can only contain "high" or "low" as value
  • numViews should not contain negative values
  • at least half of the values in description should contain a url
  • the median of numViews should be less than or equal to 10

In code this looks as follows:

from hooqu.checks import Check, CheckLevel, CheckStatus
from hooqu.verification_suite import VerificationSuite
from hooqu.constraints import ConstraintStatus


verification_result = (
      VerificationSuite()
      .on_data(df)
      .add_check(
          Check(CheckLevel.ERROR, "Basic Check")
          .has_size(lambda sz: sz == 5)  # we expect 5 rows
          .is_complete("id")  # should never be None/Null
          .is_unique("id")  # should not contain duplicates
          .is_complete("productName")  # should never be None/Null
          .is_contained_in("priority", ("high", "low"))
          .is_non_negative("numViews")
          # at least half of the descriptions should contain a url
          .contains_url("description", lambda d: d >= 0.5)
          # half of the items should have less than 10 views
          .has_quantile("numViews", 0.5, lambda v: v <= 10)
      )
      .run()
)

After calling run, hooqu will compute some metrics on the data. Afterwards it invokes your assertion functions (e.g., lambda sz: sz == 5 for the size check) on these metrics to see if the constraints hold on the data.

We can inspect the VerificationResult to see if the test found errors:

if verification_result.status == CheckStatus.SUCCESS:
      print("Alles klar: The data passed the test, everything is fine!")
else:
      print("We found errors in the data")

for check_result in verification_result.check_results.values():
      for cr in check_result.constraint_results:
          if cr.status != ConstraintStatus.SUCCESS:
              print(f"{cr.constraint}: {cr.message}")

If we run the example, we get the following output:

We found errors in the data
CompletenessConstraint(Completeness(productName)): Value 0.8 does not meet the constraint requirement.
PatternMatchConstraint(containsURL(description)): Value 0.4 does not meet the constraint requirement.

The test found that our assumptions are violated! Only 4 out of 5 (80%) of the values of the productName attribute are non-null and only 2 out of 5 (40%) values of the description attribute contained a url. Fortunately, we ran a test and found the errors, somebody should immediately fix the data :)

Contributing

All contributions, bug reports, bug fixes, documentation improvements, enhancements and ideas are welcome. Please use GitHub issues: for bug reports, feature requests, install issues, RFCs, thoughts, etc.

See the full cotributing guide for more information.

Why Hooqu?

  • Easy to use declarative API to add data verification steps to your data processing pipeline.
  • The VerificationResult allows you know not only what check fail but the values of the computed metric, allowing for flexible handling of issues with the data.
  • Incremental metric computation capability allows to compare quality metrics change across time (planned).
  • Support for storing and loading computed metrics (planned).

References

This project is a "spiritual" port of Apache Deequ and thus tries to implement the declarative API described on the paper "Automating large-scale data quality verification" while trying to remain pythonic as much as possible. This project does not use (py)Spark but rather Pandas (and hopefully in the future it will support other compatible dataframe implementations).

Name

Jukumari (pronounced hooqumari) is the Aymara name for the spectacled bear (Tremarctos ornatus), also known as the Andean bear, Andean short-faced bear, or mountain bear.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].