All Projects → great-expectations → Great_expectations

great-expectations / Great_expectations

Licence: apache-2.0
Always know what to expect from your data.

Programming Languages

python
139335 projects - #7 most used programming language
Jupyter Notebook
11667 projects
Jinja
831 projects
HTML
75241 projects
javascript
184084 projects - #8 most used programming language
CSS
56736 projects

Projects that are alternatives of or similar to Great expectations

Pandas Profiling
Create HTML profiling reports from pandas DataFrame objects
Stars: ✭ 8,329 (+43.41%)
Mutual labels:  data-science, exploratory-data-analysis, eda, data-quality, data-profiling
Sweetviz
Visualize and compare datasets, target values and associations, with one line of code.
Stars: ✭ 1,851 (-68.13%)
Mutual labels:  data-science, exploratory-data-analysis, eda, data-profiling
datatile
A library for managing, validating, summarizing, and visualizing data.
Stars: ✭ 419 (-92.79%)
Mutual labels:  data-quality, data-profiling, mlops
Dataprep
DataPrep — The easiest way to prepare data in Python
Stars: ✭ 639 (-89%)
Mutual labels:  data-science, exploratory-data-analysis, eda
Data Describe
data⎰describe: Pythonic EDA Accelerator for Data Science
Stars: ✭ 269 (-95.37%)
Mutual labels:  data-science, exploratory-data-analysis, eda
Applied Ml
📚 Papers & tech blogs by companies sharing their work on data science & machine learning in production.
Stars: ✭ 17,824 (+206.89%)
Mutual labels:  data-science, data-engineering, data-quality
Setl
A simple Spark-powered ETL framework that just works 🍺
Stars: ✭ 79 (-98.64%)
Mutual labels:  data-science, pipeline, data-engineering
100 Days Of Ml Code
A day to day plan for this challenge. Covers both theoritical and practical aspects
Stars: ✭ 172 (-97.04%)
Mutual labels:  data-science, exploratory-data-analysis, eda
Pipelinex
PipelineX: Python package to build ML pipelines for experimentation with Kedro, MLflow, and more
Stars: ✭ 127 (-97.81%)
Mutual labels:  data-science, pipeline, data-engineering
Complete Life Cycle Of A Data Science Project
Complete-Life-Cycle-of-a-Data-Science-Project
Stars: ✭ 140 (-97.59%)
Mutual labels:  data-science, exploratory-data-analysis, eda
Airbyte
Airbyte is an open-source EL(T) platform that helps you replicate your data in your warehouses, lakes and databases.
Stars: ✭ 4,919 (-15.31%)
Mutual labels:  data-science, pipeline, data-engineering
leila
Librería para la evaluación de calidad de datos, e interacción con el portal de datos.gov.co
Stars: ✭ 56 (-99.04%)
Mutual labels:  exploratory-data-analysis, eda, data-quality
skimpy
skimpy is a light weight tool that provides summary statistics about variables in data frames within the console.
Stars: ✭ 236 (-95.94%)
Mutual labels:  exploratory-data-analysis, eda
Exploratory Data Analysis Visualization Python
Data analysis and visualization with PyData ecosystem: Pandas, Matplotlib Numpy, and Seaborn
Stars: ✭ 78 (-98.66%)
Mutual labels:  exploratory-data-analysis, eda
olliePy
OlliePy is a python package which can help data scientists in exploring their data and evaluating and analysing their machine learning experiments by utilising the power and structure of modern web applications. The data scientist only needs to provide the data and any required information and OlliePy will generate the rest.
Stars: ✭ 46 (-99.21%)
Mutual labels:  exploratory-data-analysis, eda
traceml
Engine for ML/Data tracking, visualization, dashboards, and model UI for Polyaxon.
Stars: ✭ 445 (-92.34%)
Mutual labels:  data-profiling, mlops
re-data
re_data - fix data issues before your users & CEO would discover them 😊
Stars: ✭ 955 (-83.56%)
Mutual labels:  data-quality, dataquality
beneath
Beneath is a serverless real-time data platform ⚡️
Stars: ✭ 65 (-98.88%)
Mutual labels:  data-engineering, mlops
versatile-data-kit
Versatile Data Kit (VDK) is an open source framework that enables anybody with basic SQL or Python knowledge to create their own data pipelines.
Stars: ✭ 144 (-97.52%)
Mutual labels:  data-engineering, data-quality
kedro
A Python framework for creating reproducible, maintainable and modular data science code.
Stars: ✭ 6,068 (+4.48%)
Mutual labels:  pipeline, mlops

Build Status Coverage Documentation Status DOI

Great Expectations

Always know what to expect from your data.

Introduction

Great Expectations helps data teams eliminate pipeline debt, through data testing, documentation, and profiling.

Software developers have long known that testing and documentation are essential for managing complex codebases. Great Expectations brings the same confidence, integrity, and acceleration to data science and data engineering teams.

See Down with Pipeline Debt! for an introduction to the philosophy of pipeline testing.

Key features

Expectations

Expectations are assertions for data. They are the workhorse abstraction in Great Expectations, covering all kinds of common data issues, including:

  • expect_column_values_to_not_be_null
  • expect_column_values_to_match_regex
  • expect_column_values_to_be_unique
  • expect_column_values_to_match_strftime_format
  • expect_table_row_count_to_be_between
  • expect_column_median_to_be_between
  • ...and many more

Expectations are declarative, flexible and extensible.

Batteries-included data validation

Expectations are a great start, but it takes more to get to production-ready data validation. Where are Expectations stored? How do they get updated? How do you securely connect to production data systems? How do you notify team members and triage when data validation fails?

Great Expectations supports all of these use cases out of the box. Instead of building these components for yourself over weeks or months, you will be able to add production-ready validation to your pipeline in a day. This “Expectations on rails” framework plays nice with other data engineering tools, respects your existing name spaces, and is designed for extensibility.

ooooo ahhhh

Tests are docs and docs are tests

! This feature is in beta

Many data teams struggle to maintain up-to-date data documentation. Great Expectations solves this problem by rendering Expectations directly into clean, human-readable documentation.

Since docs are rendered from tests, and tests are run against new data as it arrives, your documentation is guaranteed to never go stale. Additional renderers allow Great Expectations to generate other type of "documentation", including slack notifications, data dictionaries, customized notebooks, etc.

Your tests are your docs and your docs are your tests

Automated data profiling

- This feature is experimental

Wouldn't it be great if your tests could write themselves? Run your data through one of Great Expectations' data profilers and it will automatically generate Expectations and data documentation. Profiling provides the double benefit of helping you explore data faster, and capturing knowledge for future documentation and testing.

ooooo ahhhh

Automated profiling doesn't replace domain expertise—you will almost certainly tune and augment your auto-generated Expectations over time—but it's a great way to jump start the process of capturing and sharing domain knowledge across your team.

Pluggable and extensible

Every component of the framework is designed to be extensible: Expectations, storage, profilers, renderers for documentation, actions taken after validation, etc. This design choice gives a lot of creative freedom to developers working with Great Expectations.

Recent extensions include:

We're very excited to see what other plugins the data community comes up with!

Quick start

To see Great Expectations in action on your own data:

You can install it using pip

pip install great_expectations

or conda

conda install -c conda-forge great-expectations

and then run

great_expectations init

(We recommend deploying within a virtual environment. If you’re not familiar with pip, virtual environments, notebooks, or git, you may want to check out the Supporting Resources, which will teach you how to get up and running in minutes.)

For full documentation, visit Great Expectations on readthedocs.io.

If you need help, hop into our Slack channel—there are always contributors and other users there.

Integrations

Great Expectations works with the tools and systems that you're already using with your data, including:

Integration Notes
Pandas Great for in-memory machine learning pipelines!
Spark Good for really big data.
Postgres Leading open source database
BigQueryGoogle serverless massive-scale SQL analytics platform
DatabricksManaged Spark Analytics Platform
MySQL Leading open source database
AWS Redshift Cloud-based data warehouse
AWS S3 Cloud based blob storage
Snowflake Cloud-based data warehouse
Apache Airflow An open source orchestration engine
Other SQL Relational DBs Most RDBMS are supported via SQLalchemy
Jupyter Notebooks The best way to build Expectations
Slack Get automatic data quality notifications!

What does Great Expectations not do?

Great Expectations is not a pipeline execution framework.

We aim to integrate seamlessly with DAG execution tools like Spark, Airflow, dbt, prefect, dagster, Kedro, Flyte, etc. We DON'T execute your pipelines for you.

Great Expectations is not a data versioning tool.

Great Expectations does not store data itself. Instead, it deals in metadata about data: Expectations, validation results, etc. If you want to bring your data itself under version control, check out tools like: DVC and Quilt.

Great Expectations currently works best in a python/bash environment.

Following the philosophy of "take the compute to the data," Great Expectations currently supports native execution of Expectations in three environments: pandas, SQL (through the SQLAlchemy core), and Spark. That said, all orchestration in Great Expectations is python-based. You can invoke it from the command line without using a python programming environment, but if you're working in another ecosystem, other tools might be a better choice. If you're running in a pure R environment, you might consider assertR as an alternative. Within the Tensorflow ecosystem, TFDV fulfills a similar function as Great Expectations.

Who maintains Great Expectations?

Great Expectations is under active development by James Campbell, Abe Gong, Eugene Mandel, Rob Lim, Taylor Miller, with help from many others.

What's the best way to get in touch with the Great Expectations team?

If you have questions, comments, or just want to have a good old-fashioned chat about data pipelines, please hop on our public Slack channel

If you'd like hands-on assistance setting up Great Expectations, establishing a healthy practice of data testing, or adding functionality to Great Expectations, please see options for consulting help here.

Can I contribute to the library?

Absolutely. Yes, please. Start here and please don't be shy with questions.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].