All Projects → HoloClean → Holoclean

HoloClean / Holoclean

Licence: apache-2.0
A Machine Learning System for Data Enrichment.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Holoclean

Evidently
Interactive reports to analyze machine learning models during validation or production monitoring.
Stars: ✭ 304 (-11.63%)
Mutual labels:  data-science
Keras Mmoe
A Keras implementation of "Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts" (KDD 2018)
Stars: ✭ 332 (-3.49%)
Mutual labels:  data-science
Scikit Mobility
scikit-mobility: mobility analysis in Python
Stars: ✭ 339 (-1.45%)
Mutual labels:  data-science
Akshare
AKShare is an elegant and simple financial data interface library for Python, built for human beings! 开源财经数据接口库
Stars: ✭ 4,334 (+1159.88%)
Mutual labels:  data-science
Datmo
Open source production model management tool for data scientists
Stars: ✭ 334 (-2.91%)
Mutual labels:  data-science
Dashr
Dash for R - An R interface to the Dash ecosystem for creating analytic web applications
Stars: ✭ 337 (-2.03%)
Mutual labels:  data-science
Carefree Learn
A minimal Automatic Machine Learning (AutoML) solution for tabular datasets based on PyTorch
Stars: ✭ 316 (-8.14%)
Mutual labels:  data-science
Csinva.github.io
Slides, paper notes, class notes, blog posts, and research on ML 📉, statistics 📊, and AI 🤖.
Stars: ✭ 342 (-0.58%)
Mutual labels:  data-science
Kaggle public
阿水的数据竞赛开源分支
Stars: ✭ 335 (-2.62%)
Mutual labels:  data-science
Graph Fraud Detection Papers
A curated list of fraud detection papers using graph information or graph neural networks
Stars: ✭ 339 (-1.45%)
Mutual labels:  data-science
Machine Learning For Trading
Code for Machine Learning for Algorithmic Trading, 2nd edition.
Stars: ✭ 4,979 (+1347.38%)
Mutual labels:  data-science
Artificio
Deep Learning Computer Vision Algorithms for Real-World Use
Stars: ✭ 326 (-5.23%)
Mutual labels:  data-science
Dash Docs
📖 The Official Dash Userguide & Documentation
Stars: ✭ 338 (-1.74%)
Mutual labels:  data-science
Probability
Probabilistic reasoning and statistical analysis in TensorFlow
Stars: ✭ 3,550 (+931.98%)
Mutual labels:  data-science
Experiments with python
experiments with python
Stars: ✭ 342 (-0.58%)
Mutual labels:  data-science
Pandasvault
Advanced Pandas Vault — Utilities, Functions and Snippets (by @firmai).
Stars: ✭ 316 (-8.14%)
Mutual labels:  data-science
Mlxtend
A library of extension and helper modules for Python's data analysis and machine learning libraries.
Stars: ✭ 3,729 (+984.01%)
Mutual labels:  data-science
Thesemicolon
This repository contains Ipython notebooks and datasets for the data analytics youtube tutorials on The Semicolon.
Stars: ✭ 345 (+0.29%)
Mutual labels:  data-science
Deltapy
DeltaPy - Tabular Data Augmentation (by @firmai)
Stars: ✭ 344 (+0%)
Mutual labels:  data-science
Eseur Code Data
Code and data used to create the examples in "Evidence-based Software Engineering based on the publicly available data"
Stars: ✭ 340 (-1.16%)
Mutual labels:  data-science

Master: Build Status Dev: Build Status

HoloClean: A Machine Learning System for Data Enrichment

HoloClean is built on top of PyTorch and PostgreSQL.

HoloClean is a statistical inference engine to impute, clean, and enrich data. As a weakly supervised machine learning system, HoloClean leverages available quality rules, value correlations, reference data, and multiple other signals to build a probabilistic model that accurately captures the data generation process, and uses the model in a variety of data curation tasks. HoloClean allows data practitioners and scientists to save the enormous time they spend in building piecemeal cleaning solutions, and instead, effectively communicate their domain knowledge in a declarative way to enable accurate analytics, predictions, and insights form noisy, incomplete, and erroneous data.

Installation

HoloClean was tested on Python versions 2.7, 3.6, and 3.7. It requires PostgreSQL version 9.4 or higher.

1. Install and configure PostgreSQL

We describe how to install PostgreSQL and configure it for HoloClean (creating a database, a user, and setting the required permissions).

Option 1: Native installation of PostgreSQL

A native installation of PostgreSQL runs faster than docker containers. We explain how to install PostgreSQL then how to configure it for HoloClean use.

a. Installing PostgreSQL

On Ubuntu, install PostgreSQL by running $ apt-get install postgresql postgresql-contrib

For macOS, you can find the installation instructions on https://www.postgresql.org/download/macosx/

b. Setting up PostgreSQL for HoloClean

By default, HoloClean needs a database holo and a user holocleanuser with permissions on it.

  1. Start the PostgreSQL psql console from the terminal using
    $ psql --user <username>. You can omit --user <username> to use current user.

  2. Create a database holo and user holocleanuser

CREATE DATABASE holo;
CREATE USER holocleanuser;
ALTER USER holocleanuser WITH PASSWORD 'abcd1234';
GRANT ALL PRIVILEGES ON DATABASE holo TO holocleanuser;
\c holo
ALTER SCHEMA public OWNER TO holocleanuser;

You can connect to the holo database from the PostgreSQL psql console by running psql -U holocleanuser -W holo.

HoloClean currently populates the database holo with auxiliary and meta tables. To clear the database simply connect as a root user or as holocleanuser and run

DROP DATABASE holo;
CREATE DATABASE holo;

Option 2: Using Docker

If you are familiar with docker, an easy way to start using HoloClean is to start a PostgreSQL docker container.

To start a PostgreSQL docker container, run the following command:

docker run --name pghc \
    -e POSTGRES_DB=holo -e POSTGRES_USER=holocleanuser -e POSTGRES_PASSWORD=abcd1234 \
    -p 5432:5432 \
    -d postgres:11

which starts a backend server and creates a database with the required permissions.

You can then use docker start pghc and docker stop pghc to start/stop the container.

Note the port number which may conflict with existing PostgreSQL servers. Read more about this docker image here.

2. Setting up HoloClean

HoloClean runs on Python 2.7 or 3.6+. We recommend running it from within a virtual environment.

Creating a virtual environment for HoloClean

Option 1: Conda Virtual Environment

First, download Anaconda (not miniconda) from this link. Follow the steps for your OS and framework.

Second, create a conda environment (python 2.7 or 3.6+). For example, to create a Python 3.6 conda environment, run:

$ conda create -n hc36 python=3.6

Upon starting/restarting your terminal session, you will need to activate your conda environment by running

$ conda activate hc36
Option 2: Set up a virtual environment using pip and Virtualenv

If you are familiar with virtualenv, you can use it to create a virtual environment.

For Python 3.6, create a new environment with your preferred virtualenv wrapper, for example:

Either follow instructions here or install via pip.

$ pip install virtualenv

Then, create a virtualenv environment by creating a new directory for a Python 3.6 virtualenv environment

$ mkdir -p hc36
$ virtualenv --python=python3.6 hc36

where python3.6 is a valid reference to a Python 3.6 executable.

Activate the environment

$ source hc36/bin/activate

Install the required python packages

Note: make sure that the environment is activated throughout the installation process. When you are done, deactivate it using conda deactivate, source deactivate, or deactivate depending on your version.

In the project root directory, run the following to install the required packages. Note that this commands installs the packages within the activated virtual environment.

$ pip install -r requirements.txt

Note for macOS Users: you may need to install XCode developer tools using xcode-select --install.

Running HoloClean

See the code in examples/holoclean_repair_example.py for a documented usage of HoloClean.

In order to run the example script, run the following:

$ cd examples
$ ./start_example.sh

Notice that the script sets up the Python path environment to run HoloClean.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].