All Projects → facultyai → Boltzmannclean

facultyai / Boltzmannclean

Licence: apache-2.0
Fill missing values in Pandas DataFrames using Restricted Boltzmann Machines

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Boltzmannclean

Dat8
General Assembly's 2015 Data Science course in Washington, DC
Stars: ✭ 1,516 (+6491.3%)
Mutual labels:  data-science, pandas, data-cleaning
Pdpipe
Easy pipelines for pandas DataFrames.
Stars: ✭ 590 (+2465.22%)
Mutual labels:  dataframe, data-science, pandas
Dataframe
C++ DataFrame for statistical, Financial, and ML analysis -- in modern C++ using native types, continuous memory storage, and no pointers are involved
Stars: ✭ 828 (+3500%)
Mutual labels:  dataframe, data-science, pandas
Pandasvault
Advanced Pandas Vault — Utilities, Functions and Snippets (by @firmai).
Stars: ✭ 316 (+1273.91%)
Mutual labels:  dataframe, data-science, pandas
Danfojs
danfo.js is an open source, JavaScript library providing high performance, intuitive, and easy to use data structures for manipulating and processing structured data.
Stars: ✭ 1,304 (+5569.57%)
Mutual labels:  dataframe, data-science, pandas
Pandas Videos
Jupyter notebook and datasets from the pandas Q&A video series
Stars: ✭ 1,716 (+7360.87%)
Mutual labels:  data-science, pandas, data-cleaning
Dataframe Go
DataFrames for Go: For statistics, machine-learning, and data manipulation/exploration
Stars: ✭ 487 (+2017.39%)
Mutual labels:  dataframe, data-science, pandas
Foxcross
AsyncIO serving for data science models
Stars: ✭ 18 (-21.74%)
Mutual labels:  dataframe, data-science, pandas
Koalas
Koalas: pandas API on Apache Spark
Stars: ✭ 3,044 (+13134.78%)
Mutual labels:  dataframe, data-science, pandas
Datasheets
Read data from, write data to, and modify the formatting of Google Sheets
Stars: ✭ 593 (+2478.26%)
Mutual labels:  dataframe, data-science, pandas
Data Science Ipython Notebooks
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
Stars: ✭ 22,048 (+95760.87%)
Mutual labels:  data-science, pandas
Pycon 2019 Tutorial
Data Science Best Practices with pandas
Stars: ✭ 410 (+1682.61%)
Mutual labels:  data-science, pandas
Pandapy
PandaPy has the speed of NumPy and the usability of Pandas 10x to 50x faster (by @firmai)
Stars: ✭ 474 (+1960.87%)
Mutual labels:  data-science, pandas
Pandera
A light-weight, flexible, and expressive pandas data validation library
Stars: ✭ 506 (+2100%)
Mutual labels:  pandas, data-cleaning
Stats Maths With Python
General statistics, mathematical programming, and numerical/scientific computing scripts and notebooks in Python
Stars: ✭ 381 (+1556.52%)
Mutual labels:  data-science, pandas
Pandastable
Table analysis in Tkinter using pandas DataFrames.
Stars: ✭ 376 (+1534.78%)
Mutual labels:  dataframe, pandas
Data Science Portfolio
Portfolio of data science projects completed by me for academic, self learning, and hobby purposes.
Stars: ✭ 559 (+2330.43%)
Mutual labels:  data-science, pandas
Sequoia
A股自动选股程序,实现了海龟交易法则、缠中说禅牛市买点,以及其他若干种技术形态
Stars: ✭ 564 (+2352.17%)
Mutual labels:  dataframe, pandas
Alphapy
Automated Machine Learning [AutoML] with Python, scikit-learn, Keras, XGBoost, LightGBM, and CatBoost
Stars: ✭ 564 (+2352.17%)
Mutual labels:  data-science, pandas
Smile
Statistical Machine Intelligence & Learning Engine
Stars: ✭ 5,412 (+23430.43%)
Mutual labels:  dataframe, data-science

boltzmannclean

Fill missing values in a pandas DataFrame using a Restricted Boltzmann Machine.

Provides a class implementing the scikit-learn transformer interface for creating and training a Restricted Boltzmann Machine. This can then be sampled from to fill in missing values in training data or new data of the same format. Utility functions for applying the transformations to a pandas DataFrame are provided, with the option to treat columns as either continuous numerical or categorical features.

Installation

.. code-block:: bash

pip install boltzmannclean

Usage

To fill in missing values from a DataFrame with the minimum of fuss, a cleaning function is provided.

.. code-block:: python

import boltzmannclean

my_clean_dataframe = boltzmannclean.clean(
    dataframe=my_dataframe,
    numerical_columns=['Height', 'Weight'],
    categorical_columns=['Colour', 'Shape'],
    tune_rbm=True  # tune RBM hyperparameters for my data
)

To create and use the underlying scikit-learn transformer.

.. code-block:: python

my_rbm = boltzmannclean.RestrictedBoltzmannMachine(
    n_hidden=100, learn_rate=0.01,
    batchsize=10, dropout_fraction=0.5, max_epochs=1,
    adagrad=True
)

my_rbm.fit_transform(a_numpy_array)

Here the default RBM hyperparameters are those listed above, and the numpy array operated on is expected to be composed entirely of numbers in the range [0,1] or np.nan/None. The hyperparameters are:

  • n_hidden: the size of the hidden layer
  • learn_rate: learning rate for stochastic gradient descent
  • batchsize: batchsize for stochastic gradient descent
  • dropout_fraction: fraction of hidden nodes to be dropped out on each backward pass during training
  • max_epochs: maximum number of passes over the training data
  • adagrad: whether to use the Adagrad update rules for stochastic gradient descent

Example

.. code-block:: python

import boltzmannclean
import numpy as np
import pandas as pd
from sklearn import datasets

iris = datasets.load_iris()

df_iris = pd.DataFrame(iris.data,columns=iris.feature_names)
df_iris['target'] = pd.Series(iris.target, dtype=str)

df_iris.head()

= ================= ================ ================= ================ ====== _ sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target = ================= ================ ================= ================ ====== 0 5.1 3.5 1.4 0.2 0 1 4.9 3.0 1.4 0.2 0 2 4.7 3.2 1.3 0.2 0 3 4.6 3.1 1.5 0.2 0 4 5.0 3.6 1.4 0.2 0 = ================= ================ ================= ================ ======

Add some noise:

.. code-block:: python

noise = [(0,1),(2,0),(0,4)]

for noisy in noise:
    df_iris.iloc[noisy] = None

df_iris.head()

= ================= ================ ================= ================ ====== _ sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target = ================= ================ ================= ================ ====== 0 5.1 NaN 1.4 0.2 None 1 4.9 3.0 1.4 0.2 0 2 NaN 3.2 1.3 0.2 0 3 4.6 3.1 1.5 0.2 0 4 5.0 3.6 1.4 0.2 0 = ================= ================ ================= ================ ======

Clean the DataFrame:

.. code-block:: python

df_iris_cleaned = boltzmannclean.clean(
    dataframe=df_iris,
    numerical_columns=[
        'sepal length (cm)', 'sepal width (cm)',
        'petal length (cm)', 'petal width (cm)'
    ],
    categorical_columns=['target'],
    tune_rbm=True
)

df_iris_cleaned.round(1).head()

= ================= ================ ================= ================ ====== _ sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target = ================= ================ ================= ================ ====== 0 5.1 3.3 1.4 0.2 0 1 4.9 3.0 1.4 0.2 0 2 6.3 3.2 1.3 0.2 0 3 4.6 3.1 1.5 0.2 0 4 5.0 3.6 1.4 0.2 0 = ================= ================ ================= ================ ======

The larger and more correlated the dataset is, the better the imputed values will be.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].