All Projects → HoloClean → HoloClean-Legacy-deprecated

HoloClean / HoloClean-Legacy-deprecated

Licence: Apache-2.0 license
A Machine Learning System for Data Enrichment.

Programming Languages

python
139335 projects - #7 most used programming language
Jupyter Notebook
11667 projects
shell
77523 projects

Projects that are alternatives of or similar to HoloClean-Legacy-deprecated

Data Forge Ts
The JavaScript data transformation and analysis toolkit inspired by Pandas and LINQ.
Stars: ✭ 967 (+1189.33%)
Mutual labels:  data-cleaning
Dat8
General Assembly's 2015 Data Science course in Washington, DC
Stars: ✭ 1,516 (+1921.33%)
Mutual labels:  data-cleaning
Voicebook
🗣️ A book and repo to get you started programming voice computing applications in Python (10 chapters and 200+ scripts).
Stars: ✭ 236 (+214.67%)
Mutual labels:  data-cleaning
Janitor
simple tools for data cleaning in R
Stars: ✭ 981 (+1208%)
Mutual labels:  data-cleaning
Bumblebee
🚕 A spreadsheet-like data preparation web app that works over Optimus (pandas, dask, cuDF, dask-cuDF and PySpark)
Stars: ✭ 86 (+14.67%)
Mutual labels:  data-cleaning
Datamaid
An R package for data screening
Stars: ✭ 120 (+60%)
Mutual labels:  data-cleaning
Boltzmannclean
Fill missing values in Pandas DataFrames using Restricted Boltzmann Machines
Stars: ✭ 23 (-69.33%)
Mutual labels:  data-cleaning
R-Learning-Journey
Some of the projects i made when starting to learn R for Data Science at the university
Stars: ✭ 19 (-74.67%)
Mutual labels:  data-cleaning
Refinr
Cluster and merge similar char values: an R implementation of Open Refine clustering algorithms
Stars: ✭ 91 (+21.33%)
Mutual labels:  data-cleaning
Klib
Easy to use Python library of customized functions for cleaning and analyzing data.
Stars: ✭ 192 (+156%)
Mutual labels:  data-cleaning
Optimus
🚚 Agile Data Preparation Workflows made easy with dask, cudf, dask_cudf and pyspark
Stars: ✭ 986 (+1214.67%)
Mutual labels:  data-cleaning
My Journey In The Data Science World
📢 Ready to learn or review your knowledge!
Stars: ✭ 1,175 (+1466.67%)
Mutual labels:  data-cleaning
Cleanlab
The standard package for machine learning with noisy labels, finding mislabeled data, and uncertainty quantification. Works with most datasets and models.
Stars: ✭ 2,526 (+3268%)
Mutual labels:  data-cleaning
Drugs Recommendation Using Reviews
Analyzing the Drugs Descriptions, conditions, reviews and then recommending it using Deep Learning Models, for each Health Condition of a Patient.
Stars: ✭ 35 (-53.33%)
Mutual labels:  data-cleaning
Miller
Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON
Stars: ✭ 4,633 (+6077.33%)
Mutual labels:  data-cleaning
Moodle Local datacleaner
Reduce, filter, and anonymize moodle data for non-prod environments
Stars: ✭ 12 (-84%)
Mutual labels:  data-cleaning
Pandas Videos
Jupyter notebook and datasets from the pandas Q&A video series
Stars: ✭ 1,716 (+2188%)
Mutual labels:  data-cleaning
optimus
🚚 Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark
Stars: ✭ 1,351 (+1701.33%)
Mutual labels:  data-cleaning
watson-discovery-food-reviews
Combine Watson Knowledge Studio and Watson Discovery to discover customer sentiment from product reviews
Stars: ✭ 36 (-52%)
Mutual labels:  data-enrichment
Machine Learning Workflow With Python
This is a comprehensive ML techniques with python: Define the Problem- Specify Inputs & Outputs- Data Collection- Exploratory data analysis -Data Preprocessing- Model Design- Training- Evaluation
Stars: ✭ 157 (+109.33%)
Mutual labels:  data-cleaning

HoloClean: A Machine Learning System for Data Enrichment

HoloClean is built over Spark and PyTorch.

Status

Build Status Documentation Status License

v0.1.1


HoloClean is a statistical inference engine to impute, clean, and enrich data. As a weakly supervised machine learning system, HoloClean leverages available quality rules, value correlations, reference data, and multiple other signals to build a probabilistic model that accurately captures the data generation process, and uses the model in a variety of data curation tasks. HoloClean allows data practitioners and scientists to save the enormous time they spend in building piecemeal cleaning solutions, and instead, effectively communicate their domain knowledge in a declarative way to enable accurate analytics, predictions, and insights form noisy, incomplete, and erroneous data.

Installation

This file will go through the steps needed to install the required packages and software to run HoloClean.

1. Install Postgresql

1.1 Ubuntu Installation:

Install Postgres by running:

sudo apt-get install postgresql postgresql-contrib

1.2 Using Postgres on Ubuntu

To start postgres run:

sudo -u postgres psql

1.3 Mac Installation
Check out the following page to install Postgres for MacOS
https://www.postgresql.org/download/macosx/
1.4 Setup Postgres for Holoclean

Create the database and user by running the following on the Postgres console:

CREATE database holo;
CREATE user holocleanuser;
ALTER USER holocleanuser WITH PASSWORD 'abcd1234';
GRANT ALL PRIVILEGES on database holo to holocleanUser ;
\c holo
ALTER SCHEMA public OWNER TO holocleanUser;

In general, to connect to the holo database run:

\c holo

HoloClean currently appends new tables to the database holo with each instance that is ran. To clear the database, open PSQL with holocleanUser and run:

drop database holo;
create database holo;

Or alternatively use the function reset_database() function in the Holoclean class in holoclean/holoclean.py

2. Install HoloClean Using Conda

2.1. Install Conda

2.1.1 Ubuntu:

For 32 bit machines, run:

wget  wget https://repo.continuum.io/archive/Anaconda-2.3.0-Linux-x86.sh
bash Anaconda-2.3.0-Linux-x86.sh

For 64 bit machines, run:

wget https://3230d63b5fc54e62148e-c95ac804525aac4b6dba79b00b39d1d3.ssl.cf1.rackcdn.com/Anaconda-2.3.0-Linux-x86_64.sh
bash Anaconda-2.3.0-Linux-x86_64.sh
2.1.2 MacOS:

Follow instructions here to install Anaconda (Not miniconda) for MacOS

2.2 Create new Conda environment

Open/Restart the terminal and create a Python 2.7 environment by running the command:

conda create -n py27Env python=2.7 anaconda

Then the environment can be activated by running:

source activate py27Env

Make sure to keep the environment activated for the rest of the installation process

3. Install HoloClean Using Virtualenv

If you are already familiar with Virtualenv please create a new environment with Python 2.7 with your preferred virtualenv wrapper, e.g.:

Otherwise, continue with the instructions on installing Virtualenv below.

3.1 Install Virtualenv

Install Virtualenv following the instructions from their homepage. For example install globally via pip:

$ [sudo] pip install virtualenv

3.2 Create a new Virtualenv environment

Create a new directory for your virtual environment with Python 2.7:

$ virtualenv --python=python2.7 py27Env

Where py27Env is a folder, where all virtual environments will be stored and python2.7 is a valid python executable. Activate the new py27Env envrionment with:

$ source bin/activate

Make sure to keep the environment activated for the rest of the installation process

4. Installing Required Packages

Again go to the repo's root directory directory and run:

pip install -r python-package-requirement.txt

5. Install JDK 8

5.1 For Ubuntu:
Check if you have JDK 8 installed by running

java -version

If you do not have JDK 8, run the following command:

sudo apt-get install openjdk-8-jre

5.2 For MacOS
Check if you have JDK 8 by running
/usr/libexec/java_home -V

If you do not have JDK 8, download and install JDK 8 for MacOS from the oracle website: http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html

6. Install Spark (MacOS only)

To install Spark on MacOS run

brew install apache-spark

After installation of spark, add a SPARK_HOME environment variable to your shell, and add /usr/local/Cellar/apache-spark/<version>/libexec/python to your python path.

7. Getting Started

To get started, the following tutorials in the tutorial directory will get you familiar with the HoloClean framework
To run the tutorials in Jupyter Notebook go to the root directory in the terminal and run

./start_notebook.sh

Data Loading & Denial Constraints Tutorial
Complete Pipeline
Error Detection

Developing

Installation

Follow the steps from Installation to configure your development environment.

Running Unit Tests

To run unit tests

$ cd tests/unit_tests
$ python unittest_dcfeaturizer.py 
2018-04-05 15:15:22 WARN  Utils:66 - Your hostname, apollo resolves to a loopback address: 127.0.1.1; using 192.168.0.66 instead (on interface wlan0)
2018-04-05 15:15:22 WARN  Utils:66 - Set SPARK_LOCAL_IP if you need to bind to another address
2018-04-05 15:15:23 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Time to Load Data: 11.9292609692

Time for Error Detection: 14.7168970108

.
----------------------------------------------------------------------
Ran 1 test in 28.680s

OK
$
$ python unittest_sql_dcerrordetector.py 
2018-04-05 15:16:28 WARN  Utils:66 - Your hostname, apollo resolves to a loopback address: 127.0.1.1; using 192.168.0.66 instead (on interface wlan0)
2018-04-05 15:16:28 WARN  Utils:66 - Set SPARK_LOCAL_IP if you need to bind to another address
2018-04-05 15:16:29 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Time to Load Data: 12.6399619579

Time for Error Detection: 14.1602239609

.Time to Load Data: 1.38744020462

Time for Error Detection: 8.26235389709

.Time to Load Data: 0.998204946518

Time for Error Detection: 8.1832909584

.Time to Load Data: 1.46859908104

Time for Error Detection: 6.7251560688

.
----------------------------------------------------------------------
Ran 4 tests in 62.365s

OK

Running Integration Tests

To run integration tests

cd tests
python test.py

Successful tests run looks like:

<output>
Time for Test Featurization: 3.3679060936

Time for Inference: 0.249126911163

The multiple-values precision that we have is :0.998899284535
The multiple-values recall that we have is :0.972972972973 out of 185
The single-value precision that we have is :1.0
The single-value recall that we have is :1.0 out of 0
The precision that we have is :0.999022801303
The recall that we have is :0.972972972973 out of 185
Execution finished
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].