All Projects → sodadata → Soda Sql

sodadata / Soda Sql

Licence: apache-2.0
Metric collection, data testing and monitoring for SQL accessible data

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Soda Sql

Airbyte
Airbyte is an open-source EL(T) platform that helps you replicate your data in your warehouses, lakes and databases.
Stars: ✭ 4,919 (+2743.35%)
Mutual labels:  data-science, data-engineering
Butterfree
A tool for building feature stores.
Stars: ✭ 126 (-27.17%)
Mutual labels:  data-science, data-engineering
D6t Python
Accelerate data science
Stars: ✭ 118 (-31.79%)
Mutual labels:  data-science, data-engineering
Applied Ml
📚 Papers & tech blogs by companies sharing their work on data science & machine learning in production.
Stars: ✭ 17,824 (+10202.89%)
Mutual labels:  data-science, data-engineering
Accelerator
The Accelerator is a tool for fast and reproducible processing of large amounts of data.
Stars: ✭ 137 (-20.81%)
Mutual labels:  data-science, data-engineering
Superset
Apache Superset is a Data Visualization and Data Exploration Platform
Stars: ✭ 42,634 (+24543.93%)
Mutual labels:  data-science, data-engineering
Aws Data Wrangler
Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
Stars: ✭ 2,385 (+1278.61%)
Mutual labels:  data-science, data-engineering
Data Science On Gcp
Source code accompanying book: Data Science on the Google Cloud Platform, Valliappa Lakshmanan, O'Reilly 2017
Stars: ✭ 864 (+399.42%)
Mutual labels:  data-science, data-engineering
Airflow Autoscaling Ecs
Airflow Deployment on AWS ECS Fargate Using Cloudformation
Stars: ✭ 136 (-21.39%)
Mutual labels:  airflow, data-engineering
Beyond Jupyter
🐍💻📊 All material from the PyCon.DE 2018 Talk "Beyond Jupyter Notebooks - Building your own data science platform with Python & Docker" (incl. Slides, Video, Udemy MOOC & other References)
Stars: ✭ 135 (-21.97%)
Mutual labels:  airflow, data-science
Dataengineeringproject
Example end to end data engineering project.
Stars: ✭ 82 (-52.6%)
Mutual labels:  airflow, data-engineering
Geni
A Clojure dataframe library that runs on Spark
Stars: ✭ 152 (-12.14%)
Mutual labels:  data-science, data-engineering
Setl
A simple Spark-powered ETL framework that just works 🍺
Stars: ✭ 79 (-54.34%)
Mutual labels:  data-science, data-engineering
Just Dashboard
📊 📋 Dashboards using YAML or JSON files
Stars: ✭ 1,511 (+773.41%)
Mutual labels:  data-science, data-engineering
Sayn
Data processing and modelling framework for automating tasks (incl. Python & SQL transformations).
Stars: ✭ 79 (-54.34%)
Mutual labels:  data-science, data-engineering
Spark Alchemy
Collection of open-source Spark tools & frameworks that have made the data engineering and data science teams at Swoop highly productive
Stars: ✭ 122 (-29.48%)
Mutual labels:  data-science, data-engineering
Prefect
The easiest way to automate your data
Stars: ✭ 7,956 (+4498.84%)
Mutual labels:  data-science, data-engineering
Goodreads etl pipeline
An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
Stars: ✭ 793 (+358.38%)
Mutual labels:  airflow, data-engineering
Pipelinex
PipelineX: Python package to build ML pipelines for experimentation with Kedro, MLflow, and more
Stars: ✭ 127 (-26.59%)
Mutual labels:  data-science, data-engineering
Data Science Stack Cookiecutter
🐳📊🤓Cookiecutter template to launch an awesome dockerized Data Science toolstack (incl. Jupyster, Superset, Postgres, Minio, AirFlow & API Star)
Stars: ✭ 153 (-11.56%)
Mutual labels:  airflow, data-science

Soda logo

Soda SQL

Data testing, monitoring and profiling for SQL accessible data.

License: Apache 2.0 Slack Pypi Soda SQL Build soda-sql

What does Soda SQL do?

Soda SQL allows you to

  • Stop your pipeline when bad data is detected
  • Extract metrics and column profiles through super efficient SQL
  • Full control over metrics and queries through declarative config files

Why Soda SQL?

To protect against silent data issues for the consumers of your data, it's best-practice to profile and test your data:

  • as it lands in your warehouse,
  • after every important data processing step
  • right before consumption.

This way you will prevent delivery of bad data to downstream consumers. You will spend less time firefighting and gain a better reputation.

How does Soda SQL work?

Soda SQL is a Command Line Interface (CLI) and a Python library to measure and test your data using SQL.

As input, Soda SQL uses YAML configuration files that include:

  • SQL connection details
  • What metrics to compute
  • What tests to run on the measurements

Based on those configuration files, Soda SQL will perform scans. A scan performs all measurements and runs all tests associated with one table. Typically a scan is executed after new data has arrived. All soda-sql configuration files can be checked into your version control system as part of your pipeline code.

Want to try Soda SQL? Head over to our '5 minute tutorial' and get started straight away!

"Show me the metrics"

Let's walk through an example. Simple metrics and tests can be configured in scan YAML configuration files. An example of the contents of such a file:

metrics:
    - row_count
    - missing_count
    - missing_percentage
    - values_count
    - values_percentage
    - valid_count
    - valid_percentage
    - invalid_count
    - invalid_percentage
    - min
    - max
    - avg
    - sum
    - min_length
    - max_length
    - avg_length
    - distinct
    - unique_count
    - duplicate_count
    - uniqueness
    - maxs
    - mins
    - frequent_values
    - histogram
columns:
    ID:
        metrics:
            - distinct
            - duplicate_count
        valid_format: uuid
        tests:
            duplicate_count == 0
    CATEGORY:
        missing_values:
            - N/A
            - No category
        tests:
            missing_percentage < 3
    SIZE:
        tests:
            max - min < 20
sql_metrics:
    - sql: |
        SELECT sum(volume) as total_volume_us
        FROM CUSTOMER_TRANSACTIONS
        WHERE country = 'US'
      tests:
        - total_volume_us > 5000

Based on these configuration files, Soda SQL will scan your data each time new data arrived like this:

$ soda scan ./soda/metrics my_warehouse my_dataset
Soda 1.0 scan for dataset my_dataset on prod my_warehouse
  | SELECT column_name, data_type, is_nullable
  | FROM information_schema.columns
  | WHERE lower(table_name) = 'customers'
  |   AND table_catalog = 'datasource.database'
  |   AND table_schema = 'datasource.schema'
  - 0.256 seconds
Found 4 columns: ID, NAME, CREATE_DATE, COUNTRY
  | SELECT
  |  COUNT(*),
  |  COUNT(CASE WHEN ID IS NULL THEN 1 END),
  |  COUNT(CASE WHEN ID IS NOT NULL AND ID regexp '\b[0-9a-f]{8}\b-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-\b[0-9a-f]{12}\b' THEN 1 END),
  |  MIN(LENGTH(ID)),
  |  AVG(LENGTH(ID)),
  |  MAX(LENGTH(ID)),
  | FROM customers
  - 0.557 seconds
row_count : 23543
missing   : 23
invalid   : 0
min_length: 9
avg_length: 9
max_length: 9

...more queries...

47 measurements computed
23 tests executed
All is good. No tests failed. Scan took 23.307 seconds

The next step is to add Soda SQL scans in your favorite data pipeline orchestration solution like:

  • Airflow
  • AWS Glue
  • Prefect
  • Dagster
  • Fivetran
  • Matillion
  • Luigi

If you like the goals of this project, encourage us! Star sodadata/soda-sql on Github.

Next, head over to our '5 minute tutorial' and get your first project going!

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].