All Projects → sodadata → soda-spark

sodadata / soda-spark

Licence: Apache-2.0 license
Soda Spark is a PySpark library that helps you with testing your data in Spark Dataframes

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to soda-spark

re-data
re_data - fix data issues before your users & CEO would discover them 😊
Stars: ✭ 955 (+1546.55%)
Mutual labels:  data-quality, data-testing, data-observability
contessa
Easy way to define, execute and store quality rules for your data.
Stars: ✭ 17 (-70.69%)
Mutual labels:  data-engineering, data-quality
Great expectations
Always know what to expect from your data.
Stars: ✭ 5,808 (+9913.79%)
Mutual labels:  data-engineering, data-quality
versatile-data-kit
Versatile Data Kit (VDK) is an open source framework that enables anybody with basic SQL or Python knowledge to create their own data pipelines.
Stars: ✭ 144 (+148.28%)
Mutual labels:  data-engineering, data-quality
Pyspark Example Project
Example project implementing best practices for PySpark ETL jobs and applications.
Stars: ✭ 633 (+991.38%)
Mutual labels:  pyspark, data-engineering
Applied Ml
📚 Papers & tech blogs by companies sharing their work on data science & machine learning in production.
Stars: ✭ 17,824 (+30631.03%)
Mutual labels:  data-engineering, data-quality
check-engine
Data validation library for PySpark 3.0.0
Stars: ✭ 29 (-50%)
Mutual labels:  pyspark, data-quality
jobAnalytics and search
JobAnalytics system consumes data from multiple sources and provides valuable information to both job hunters and recruiters.
Stars: ✭ 25 (-56.9%)
Mutual labels:  pyspark, data-engineering
DataEngineering
This repo contains commands that data engineers use in day to day work.
Stars: ✭ 47 (-18.97%)
Mutual labels:  pyspark, data-engineering
Butterfree
A tool for building feature stores.
Stars: ✭ 126 (+117.24%)
Mutual labels:  pyspark, data-engineering
Morphl Community Edition
MorphL Community Edition uses big data and machine learning to predict user behaviors in digital products and services with the end goal of increasing KPIs (click-through rates, conversion rates, etc.) through personalization
Stars: ✭ 253 (+336.21%)
Mutual labels:  pyspark
airflow-dbt-python
A collection of Airflow operators, hooks, and utilities to elevate dbt to a first-class citizen of Airflow.
Stars: ✭ 111 (+91.38%)
Mutual labels:  data-engineering
pyspark-cassandra
pyspark-cassandra is a Python port of the awesome @datastax Spark Cassandra connector. Compatible w/ Spark 2.0, 2.1, 2.2, 2.3 and 2.4
Stars: ✭ 70 (+20.69%)
Mutual labels:  pyspark
jgit-spark-connector
jgit-spark-connector is a library for running scalable data retrieval pipelines that process any number of Git repositories for source code analysis.
Stars: ✭ 71 (+22.41%)
Mutual labels:  pyspark
Quinn
pyspark methods to enhance developer productivity 📣 👯 🎉
Stars: ✭ 217 (+274.14%)
Mutual labels:  pyspark
spark3D
Spark extension for processing large-scale 3D data sets: Astrophysics, High Energy Physics, Meteorology, …
Stars: ✭ 23 (-60.34%)
Mutual labels:  pyspark
Gimel
Big Data Processing Framework - Unified Data API or SQL on Any Storage
Stars: ✭ 216 (+272.41%)
Mutual labels:  pyspark
Mmlspark
Simple and Distributed Machine Learning
Stars: ✭ 2,899 (+4898.28%)
Mutual labels:  pyspark
Spark Practice
Apache Spark (PySpark) Practice on Real Data
Stars: ✭ 200 (+244.83%)
Mutual labels:  pyspark
qsv
CSVs sliced, diced & analyzed.
Stars: ✭ 438 (+655.17%)
Mutual labels:  data-engineering

Soda Spark


Data testing, monitoring, and profiling for Spark Dataframes.

License: Apache 2.0 Slack Pypi Soda PARK Build soda-spark

Soda Spark is an extension of Soda SQL that allows you to run Soda SQL functionality programmatically on a Spark data frame.

Soda SQL is an open-source command-line tool. It utilizes user-defined input to prepare SQL queries that run tests on tables in a data warehouse to find invalid, missing, or unexpected data. When tests fail, they surface "bad" data that you can fix to ensure that downstream analysts are using "good" data to make decisions.

Requirements

Soda Spark has the same requirements as soda-sql-spark.

Install

From your shell, execute the following command.

$ pip install soda-spark

Use

From your Python prompt, execute the following commands.

>>> from pyspark.sql import DataFrame, SparkSession
>>> from sodaspark import scan
>>>
>>> spark_session = SparkSession.builder.getOrCreate()
>>>
>>> id = "a76824f0-50c0-11eb-8be8-88e9fe6293fd"
>>> df = spark_session.createDataFrame([
...	   {"id": id, "name": "Paula Landry", "size": 3006},
...	   {"id": id, "name": "Kevin Crawford", "size": 7243}
... ])
>>>
>>> scan_definition = ("""
... table_name: demodata
... metrics:
... - row_count
... - max
... - min_length
... tests:
... - row_count > 0
... columns:
...   id:
...     valid_format: uuid
...     tests:
...     - invalid_percentage == 0
... sql_metrics:
... - sql: |
...     SELECT sum(size) as total_size_us
...     FROM demodata
...     WHERE country = 'US'
...   tests:
...   - total_size_us > 5000
... """)
>>> scan_result = scan.execute(scan_definition, df)
>>>
>>> scan_result.measurements  # doctest: +ELLIPSIS
[Measurement(metric='schema', ...), Measurement(metric='row_count', ...), ...]
>>> scan_result.test_results  # doctest: +ELLIPSIS
[TestResult(test=Test(..., expression='row_count > 0', ...), passed=True, skipped=False, ...)]
>>>

Or, use a scan YAML file

>>> scan_yml = "static/demodata.yml"
>>> scan_result = scan.execute(scan_yml, df)
>>>
>>> scan_result.measurements  # doctest: +ELLIPSIS
[Measurement(metric='schema', ...), Measurement(metric='row_count', ...), ...]
>>>

See the scan result object for all attributes and methods.

Or, return Spark data frames:

>>> measurements, test_results, errors = scan.execute(scan_yml, df, as_frames=True)
>>>
>>> measurements  # doctest: +ELLIPSIS
DataFrame[metric: string, column_name: string, value: string, ...]
>>> test_results  # doctest: +ELLIPSIS
DataFrame[test: struct<...>, passed: boolean, skipped: boolean, values: map<string,string>, ...]
>>>

See the _to_data_frame functions in the scan.py to see how the conversion is done.

Send results to Soda cloud

Send the scan result to Soda cloud.

>>> import os
>>> from sodasql.soda_server_client.soda_server_client import SodaServerClient
>>>
>>> soda_server_client = SodaServerClient(
...     host="cloud.soda.io",
...     api_key_id=os.getenv("API_PUBLIC"),
...     api_key_secret=os.getenv("API_PRIVATE"),
... )
>>> scan_result = scan.execute(scan_yml, df, soda_server_client=soda_server_client)
>>>

Understand

Under the hood soda-spark does the following.

  1. Setup the scan
  2. Create (or replace) global temporary view for the Spark data frame
  3. Execute the scan on the temporary view
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].