All Projects → mikulskibartosz → check-engine

mikulskibartosz / check-engine

Licence: MIT license
Data validation library for PySpark 3.0.0

Programming Languages

python
139335 projects - #7 most used programming language
Dockerfile
14818 projects

Projects that are alternatives of or similar to check-engine

aut
The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
Stars: ✭ 111 (+282.76%)
Mutual labels:  big-data, pyspark
Bitcoin Value Predictor
[NOT MAINTAINED] Predicting Bit coin price using Time series analysis and sentiment analysis of tweets on bitcoin
Stars: ✭ 91 (+213.79%)
Mutual labels:  big-data, pyspark
mmtf-workshop-2018
Structural Bioinformatics Training Workshop & Hackathon 2018
Stars: ✭ 50 (+72.41%)
Mutual labels:  big-data, pyspark
Mmlspark
Simple and Distributed Machine Learning
Stars: ✭ 2,899 (+9896.55%)
Mutual labels:  big-data, pyspark
soda-spark
Soda Spark is a PySpark library that helps you with testing your data in Spark Dataframes
Stars: ✭ 58 (+100%)
Mutual labels:  pyspark, data-quality
pyspark-cheatsheet
PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
Stars: ✭ 115 (+296.55%)
Mutual labels:  big-data, pyspark
Pyspark Setup Demo
Demo of PySpark and Jupyter Notebook with the Jupyter Docker Stacks
Stars: ✭ 24 (-17.24%)
Mutual labels:  big-data, pyspark
big data
A collection of tutorials on Hadoop, MapReduce, Spark, Docker
Stars: ✭ 34 (+17.24%)
Mutual labels:  big-data, pyspark
Gimel
Big Data Processing Framework - Unified Data API or SQL on Any Storage
Stars: ✭ 216 (+644.83%)
Mutual labels:  big-data, pyspark
Spark With Python
Fundamentals of Spark with Python (using PySpark), code examples
Stars: ✭ 150 (+417.24%)
Mutual labels:  big-data, pyspark
Spark Py Notebooks
Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks
Stars: ✭ 1,338 (+4513.79%)
Mutual labels:  big-data, pyspark
datalake-etl-pipeline
Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations
Stars: ✭ 39 (+34.48%)
Mutual labels:  big-data, pyspark
pyspark-algorithms
PySpark Algorithms Book: https://www.amazon.com/dp/B07X4B2218/ref=sr_1_2
Stars: ✭ 72 (+148.28%)
Mutual labels:  big-data, pyspark
SynapseML
Simple and Distributed Machine Learning
Stars: ✭ 3,355 (+11468.97%)
Mutual labels:  big-data, pyspark
xcast
A High-Performance Data Science Toolkit for the Earth Sciences
Stars: ✭ 28 (-3.45%)
Mutual labels:  big-data
OnlineStatsBase.jl
Base types for OnlineStats.
Stars: ✭ 26 (-10.34%)
Mutual labels:  big-data
bigquery-kafka-connect
☁️ nodejs kafka connect connector for Google BigQuery
Stars: ✭ 17 (-41.38%)
Mutual labels:  big-data
jupyterlab-sparkmonitor
JupyterLab extension that enables monitoring launched Apache Spark jobs from within a notebook
Stars: ✭ 78 (+168.97%)
Mutual labels:  pyspark
osm-data-classification
Migrated to: https://gitlab.com/Oslandia/osm-data-classification
Stars: ✭ 23 (-20.69%)
Mutual labels:  data-quality
MLBD
Materials for "Machine Learning on Big Data" course
Stars: ✭ 20 (-31.03%)
Mutual labels:  big-data

Summary

The goal of this project is to implement a data validation library for PySpark. The library should detect the incorrect structure of the data, unexpected values in columns, and anomalies in the data.

How to install

pip install checkengine==0.2.0

How to use

from checkengine.validate_df import ValidateSparkDataFrame

result = ValidateSparkDataFrame(spark_session, spark_data_frame) \
        .is_not_null("column_name") \
        .are_not_null(["column_name_2", "column_name_3"]) \
        .is_min("numeric_column", 10) \
        .is_max("numeric_column", 20) \
        .is_unique("column_name") \
        .are_unique(["column_name_2", "column_name_3"]) \
        .is_between("numeric_column_2", 10, 15) \
        .has_length_between("text_column", 0, 10) \
        .mean_column_value("numeric_column", 10, 20) \
        .median_column_value("numeric_column", 5, 15) \
        .text_matches_regex("text_column", "^[a-z]{3,10}$") \
        .one_of("text_column", ["value_a", "value_b"]) \
        .one_of("numeric_column", [123, 456]) \
        .execute()

result.correct_data #rows that passed the validation
result.erroneous_data #rows rejected during the validation
results.errors a summary of validation errors (three fields: column_name, constraint_name, number_of_errors)

How to build

  1. Install the Poetry build tool.

  2. Run the following commands:

cd check-engine-lib
poetry build

How to test locally

Run all tests

cd check-engine-lib
poetry run pytest tests/

Run a single test file

cd check-engine-lib
poetry run pytest tests/test_between_integer.py

Run a single test method

cd check-engine-lib
poetry run pytest tests/test_between_integer.py -k 'test_should_return_df_without_changes_if_all_are_between'

How to test in Docker

docker build -t check-engine-test check-engine-lib/. && docker run check-engine-test
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].