All Projects → whylabs → Whylogs

whylabs / Whylogs

Licence: apache-2.0
Profile and monitor your ML data pipeline end-to-end

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Whylogs

Trump Lies
Tutorial: Web scraping in Python with Beautiful Soup
Stars: ✭ 201 (-38.72%)
Mutual labels:  jupyter-notebook, dataset
Covid Chestxray Dataset
We are building an open database of COVID-19 cases with chest X-ray or CT images.
Stars: ✭ 2,759 (+741.16%)
Mutual labels:  jupyter-notebook, dataset
Covid19za
Coronavirus COVID-19 (2019-nCoV) Data Repository and Dashboard for South Africa
Stars: ✭ 208 (-36.59%)
Mutual labels:  jupyter-notebook, dataset
Shape Detection
🟣 Object detection of abstract shapes with neural networks
Stars: ✭ 170 (-48.17%)
Mutual labels:  jupyter-notebook, dataset
Tehran Stocks
A python package to access tsetmc data
Stars: ✭ 282 (-14.02%)
Mutual labels:  jupyter-notebook, dataset
Data Science Resources
👨🏽‍🏫You can learn about what data science is and why it's important in today's modern world. Are you interested in data science?🔋
Stars: ✭ 171 (-47.87%)
Mutual labels:  jupyter-notebook, dataset
Datasets
source{d} datasets ("big code") for source code analysis and machine learning on source code
Stars: ✭ 231 (-29.57%)
Mutual labels:  jupyter-notebook, dataset
Gossiping Chinese Corpus
PTT 八卦版問答中文語料
Stars: ✭ 137 (-58.23%)
Mutual labels:  jupyter-notebook, dataset
Data Science Hacks
Data Science Hacks consists of tips, tricks to help you become a better data scientist. Data science hacks are for all - beginner to advanced. Data science hacks consist of python, jupyter notebook, pandas hacks and so on.
Stars: ✭ 273 (-16.77%)
Mutual labels:  jupyter-notebook, dataset
Dataset Api
The ApolloScape Open Dataset for Autonomous Driving and its Application.
Stars: ✭ 260 (-20.73%)
Mutual labels:  jupyter-notebook, dataset
Cifar 10.1
Release of CIFAR-10.1, a new test set for CIFAR-10.
Stars: ✭ 166 (-49.39%)
Mutual labels:  jupyter-notebook, dataset
Covid19 twitter
Covid-19 Twitter dataset for non-commercial research use and pre-processing scripts - under active development
Stars: ✭ 304 (-7.32%)
Mutual labels:  jupyter-notebook, dataset
Motion Sense
MotionSense Dataset for Human Activity and Attribute Recognition ( time-series data generated by smartphone's sensors: accelerometer and gyroscope)
Stars: ✭ 159 (-51.52%)
Mutual labels:  jupyter-notebook, dataset
Fifa18 All Player Statistics
A complete catalog of all the players in Fifa 18 and their complete statistics.
Stars: ✭ 185 (-43.6%)
Mutual labels:  jupyter-notebook, dataset
Lacmus
Lacmus is a cross-platform application that helps to find people who are lost in the forest using computer vision and neural networks.
Stars: ✭ 142 (-56.71%)
Mutual labels:  jupyter-notebook, dataset
Weatherbench
A benchmark dataset for data-driven weather forecasting
Stars: ✭ 227 (-30.79%)
Mutual labels:  jupyter-notebook, dataset
Coronawatchnl
Numbers concerning COVID-19 disease cases in The Netherlands by RIVM, LCPS, NICE, ECML, and Rijksoverheid.
Stars: ✭ 135 (-58.84%)
Mutual labels:  jupyter-notebook, dataset
Datasets
🎁 3,000,000+ Unsplash images made available for research and machine learning
Stars: ✭ 1,805 (+450.3%)
Mutual labels:  jupyter-notebook, dataset
Taco
🌮 Trash Annotations in Context Dataset Toolkit
Stars: ✭ 243 (-25.91%)
Mutual labels:  jupyter-notebook, dataset
Datascience course
Curso de Data Science em Português
Stars: ✭ 294 (-10.37%)
Mutual labels:  jupyter-notebook, dataset

whylogs Library

License PyPI version Coverage Status Code style: black CII Best Practices PyPi Downloads

CI Maintainability

This is a Python implementation of whylogs. The Java implementation can be found here.

Understanding the properties of data as it moves through applications is essential to keeping your ML/AI pipeline stable and improving your user experience, whether your pipeline is built for production or experimentation. whylogs is an open source statistical logging library that allows data science and ML teams to effortlessly profile ML/AI pipelines and applications, producing log files that can be used for monitoring, alerts, analytics, and error analysis.

whylogs calculates approximate statistics for datasets of any size up to TB-scale, making it easy for users to identify changes in the statistical properties of a model's inputs or outputs. Using approximate statistics allows the package to run on minimal infrastructure and monitor an entire dataset, rather than miss outliers and other anomalies by only using a sample of the data to calculate statistics. These qualities make whylogs an excellent solution for profiling production ML/AI pipelines that operate on TB-scale data and with enterprise SLAs.

For questions and discussions, hop on our slack channel!

Key Features

  • Data Insight: whylogs provides complex statistics across different stages of your ML/AI pipelines and applications.

  • Scalability: whylogs scales with your system, from local development mode to live production systems in multi-node clusters, and works well with batch and streaming architectures.

  • Lightweight: whylogs produces small mergeable lightweight outputs in a variety of formats, using sketching algorithms and summarizing statistics.

  • Unified data instrumentation: To enable data engineering pipelines and ML pipelines to share a common framework for tracking data quality and drifts, the whylogs library supports multiple languages and integrations.

  • Observability: In addition to supporting traditional monitoring approaches, whylogs data can support advanced ML-focused analytics, error analysis, and data quality and data drift detection.

Statistical Profile

whylogs collects approximate statistics and sketches of data on a column-basis into a statistical profile. These metrics include:

  • Simple counters: boolean, null values, data types.
  • Summary statistics: sum, min, max, variance.
  • Unique value counter or cardinality: tracks an approximate unique value of your feature using HyperLogLog algorithm.
  • Histograms for numerical features. whylogs binary output can be queried to with dynamic binning based on the shape of your data.
  • Top frequent items (default is 128). Note that this configuration affects the memory footprint, especially for text features.

Examples

For a full set of our examples, please check out whylogs-examples.

Note that to use the run with matplotlib vizualiation, you'll have to install whylogs with viz dependencies:

pip install "whylogs[viz]"

Check out our example notebooks with Binder: Binder

Installation

Using pip

PyPi Downloads PyPi Version

Install whylogs using the pip package manager by running

pip install whylogs

From source

Download the source code by cloning the repository or by pressing 'Download ZIP' on this page. Install by navigating to the desired directory and running

python setup.py install

Documentation

API documentation for whylogs can be found at whylogs.readthedocs.io.

Demo CLI

Our demo CLI generates a demo project flow by running

 whylogs-demo init

Quick start CLI

whylogs can be configured programmatically or by using our config YAML file. The quick start CLI can help you bootstrap the configuration for your project. To use the quick start CLI, run the following command in the root of your Python project.

 whylogs init

Glossary/Concepts

Project: A collection of related data sets used for multiple models or applications.

Pipeline: One or more datasets used to build a single model or application. A project may contain multiple pipelines.

Dataset: A collection of records. whylogs v0.0.2 supports structured datasets, which represent data as a table where each row is a different record and each column is a feature of the record.

Feature: In the context of whylogs v0.0.2 and structured data, a feature is a column in a dataset. A feature can be discrete (like gender or eye color) or continuous (like age or salary).

whylogs Output: whylogs returns profile summary files for a dataset in JSON format. For convenience, these files are provided in flat table, histogram, and frequency formats.

Statistical Profile: A collection of statistical properties of a feature. Properties can be different for discrete and continuous features.

Integrations

The whylogs library is integrated with the following:

Dependencies

For the core requirements, see requirements.txt.

For the development environment, see requirements-dev.txt.

Development/contributing

For more information on contributing to whylogs, see DEVELOPMENT.md.

Who maintains whylogs?

whylogs is maintained by WhyLabs.

If you have any questions, comments, or just want to hang out with us, please join our Slack channel.

If you want to see whylogs in action in enterprise settings with complex visualizations, check out the WhyLabs Platform Sandbox. You'll need a GitHub/Google/LinkedIn account to login to view the sandbox (it's a 1-click experience!).

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].