Alternatives and detailed information of check-engine

Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations

Stars: ✭ 39 (+34.48%)

Mutual labels: big-data, pyspark

pyspark-algorithms

PySpark Algorithms Book: https://www.amazon.com/dp/B07X4B2218/ref=sr_1_2

Stars: ✭ 72 (+148.28%)

Mutual labels: big-data, pyspark

SynapseML

Simple and Distributed Machine Learning

Stars: ✭ 3,355 (+11468.97%)

Mutual labels: big-data, pyspark

xcast

A High-Performance Data Science Toolkit for the Earth Sciences

Stars: ✭ 28 (-3.45%)

Mutual labels: big-data

OnlineStatsBase.jl

Base types for OnlineStats.

Stars: ✭ 26 (-10.34%)

Mutual labels: big-data

bigquery-kafka-connect

☁️ nodejs kafka connect connector for Google BigQuery

Stars: ✭ 17 (-41.38%)

Mutual labels: big-data

jupyterlab-sparkmonitor

JupyterLab extension that enables monitoring launched Apache Spark jobs from within a notebook

Stars: ✭ 78 (+168.97%)

Mutual labels: pyspark

osm-data-classification

Migrated to: https://gitlab.com/Oslandia/osm-data-classification

Stars: ✭ 23 (-20.69%)

Mutual labels: data-quality

MLBD

Materials for "Machine Learning on Big Data" course

Stars: ✭ 20 (-31.03%)

Mutual labels: big-data

View All Similar Projects ➔

Summary

The goal of this project is to implement a data validation library for PySpark. The library should detect the incorrect structure of the data, unexpected values in columns, and anomalies in the data.

How to install

pip install checkengine==0.2.0

How to use

from checkengine.validate_df import ValidateSparkDataFrame

result = ValidateSparkDataFrame(spark_session, spark_data_frame) \
        .is_not_null("column_name") \
        .are_not_null(["column_name_2", "column_name_3"]) \
        .is_min("numeric_column", 10) \
        .is_max("numeric_column", 20) \
        .is_unique("column_name") \
        .are_unique(["column_name_2", "column_name_3"]) \
        .is_between("numeric_column_2", 10, 15) \
        .has_length_between("text_column", 0, 10) \
        .mean_column_value("numeric_column", 10, 20) \
        .median_column_value("numeric_column", 5, 15) \
        .text_matches_regex("text_column", "^[a-z]{3,10}$") \
        .one_of("text_column", ["value_a", "value_b"]) \
        .one_of("numeric_column", [123, 456]) \
        .execute()

result.correct_data #rows that passed the validation
result.erroneous_data #rows rejected during the validation
results.errors a summary of validation errors (three fields: column_name, constraint_name, number_of_errors)

How to build

Install the Poetry build tool.
Run the following commands:

cd check-engine-lib
poetry build

How to test locally

Run all tests

cd check-engine-lib
poetry run pytest tests/

Run a single test file

cd check-engine-lib
poetry run pytest tests/test_between_integer.py

Run a single test method

cd check-engine-lib
poetry run pytest tests/test_between_integer.py -k 'test_should_return_df_without_changes_if_all_are_between'

How to test in Docker

docker build -t check-engine-test check-engine-lib/. && docker run check-engine-test

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

mikulskibartosz / check-engine

Programming Languages

Labels

Projects that are alternatives of or similar to check-engine

Summary

How to install

How to use

How to build

How to test locally

Run all tests

Run a single test file

Run a single test method

How to test in Docker