Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

C++ DataFrame for statistical, Financial, and ML analysis -- in modern C++ using native types, continuous memory storage, and no pointers are involved

Stars: ✭ 828 (+3500%)

Mutual labels: dataframe, data-science, pandas

Pandasvault

Advanced Pandas Vault — Utilities, Functions and Snippets (by @firmai).

Stars: ✭ 316 (+1273.91%)

Mutual labels: dataframe, data-science, pandas

Danfojs

danfo.js is an open source, JavaScript library providing high performance, intuitive, and easy to use data structures for manipulating and processing structured data.

Stars: ✭ 1,304 (+5569.57%)

Mutual labels: dataframe, data-science, pandas

Pandas Videos

Jupyter notebook and datasets from the pandas Q&A video series

Stars: ✭ 1,716 (+7360.87%)

Mutual labels: data-science, pandas, data-cleaning

Dataframe Go

DataFrames for Go: For statistics, machine-learning, and data manipulation/exploration

Stars: ✭ 487 (+2017.39%)

Mutual labels: dataframe, data-science, pandas

Foxcross

AsyncIO serving for data science models

Stars: ✭ 18 (-21.74%)

Mutual labels: dataframe, data-science, pandas

Koalas

Koalas: pandas API on Apache Spark

Stars: ✭ 3,044 (+13134.78%)

Mutual labels: dataframe, data-science, pandas

Datasheets

Read data from, write data to, and modify the formatting of Google Sheets

Stars: ✭ 593 (+2478.26%)

Mutual labels: dataframe, data-science, pandas

Data Science Ipython Notebooks

Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.

Stars: ✭ 22,048 (+95760.87%)

Mutual labels: data-science, pandas

Pycon 2019 Tutorial

Data Science Best Practices with pandas

Stars: ✭ 410 (+1682.61%)

Mutual labels: data-science, pandas

Pandapy

PandaPy has the speed of NumPy and the usability of Pandas 10x to 50x faster (by @firmai)

Stars: ✭ 474 (+1960.87%)

Mutual labels: data-science, pandas

Pandera

A light-weight, flexible, and expressive pandas data validation library

Stars: ✭ 506 (+2100%)

Mutual labels: pandas, data-cleaning

Stats Maths With Python

General statistics, mathematical programming, and numerical/scientific computing scripts and notebooks in Python

Stars: ✭ 381 (+1556.52%)

Mutual labels: data-science, pandas

Pandastable

Table analysis in Tkinter using pandas DataFrames.

Stars: ✭ 376 (+1534.78%)

Mutual labels: dataframe, pandas

Data Science Portfolio

Portfolio of data science projects completed by me for academic, self learning, and hobby purposes.

Stars: ✭ 559 (+2330.43%)

Mutual labels: data-science, pandas

Sequoia

A股自动选股程序，实现了海龟交易法则、缠中说禅牛市买点，以及其他若干种技术形态

Stars: ✭ 564 (+2352.17%)

Mutual labels: dataframe, pandas

Alphapy

Automated Machine Learning [AutoML] with Python, scikit-learn, Keras, XGBoost, LightGBM, and CatBoost

Stars: ✭ 564 (+2352.17%)

Mutual labels: data-science, pandas

Smile

Statistical Machine Intelligence & Learning Engine

Stars: ✭ 5,412 (+23430.43%)

Mutual labels: dataframe, data-science

View All Similar Projects ➔

boltzmannclean

Fill missing values in a pandas DataFrame using a Restricted Boltzmann Machine.

Provides a class implementing the scikit-learn transformer interface for creating and training a Restricted Boltzmann Machine. This can then be sampled from to fill in missing values in training data or new data of the same format. Utility functions for applying the transformations to a pandas DataFrame are provided, with the option to treat columns as either continuous numerical or categorical features.

Installation

.. code-block:: bash

pip install boltzmannclean

Usage

To fill in missing values from a DataFrame with the minimum of fuss, a cleaning function is provided.

.. code-block:: python

import boltzmannclean

my_clean_dataframe = boltzmannclean.clean(
    dataframe=my_dataframe,
    numerical_columns=['Height', 'Weight'],
    categorical_columns=['Colour', 'Shape'],
    tune_rbm=True  # tune RBM hyperparameters for my data
)

To create and use the underlying scikit-learn transformer.

.. code-block:: python

my_rbm = boltzmannclean.RestrictedBoltzmannMachine(
    n_hidden=100, learn_rate=0.01,
    batchsize=10, dropout_fraction=0.5, max_epochs=1,
    adagrad=True
)

my_rbm.fit_transform(a_numpy_array)

Here the default RBM hyperparameters are those listed above, and the numpy array operated on is expected to be composed entirely of numbers in the range [0,1] or np.nan/None. The hyperparameters are:

n_hidden: the size of the hidden layer
learn_rate: learning rate for stochastic gradient descent
batchsize: batchsize for stochastic gradient descent
dropout_fraction: fraction of hidden nodes to be dropped out on each backward pass during training
max_epochs: maximum number of passes over the training data
adagrad: whether to use the Adagrad update rules for stochastic gradient descent

Example

.. code-block:: python

import boltzmannclean
import numpy as np
import pandas as pd
from sklearn import datasets

iris = datasets.load_iris()

df_iris = pd.DataFrame(iris.data,columns=iris.feature_names)
df_iris['target'] = pd.Series(iris.target, dtype=str)

df_iris.head()

= ================= ================ ================= ================ ====== _ sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target = ================= ================ ================= ================ ====== 0 5.1 3.5 1.4 0.2 0 1 4.9 3.0 1.4 0.2 0 2 4.7 3.2 1.3 0.2 0 3 4.6 3.1 1.5 0.2 0 4 5.0 3.6 1.4 0.2 0 = ================= ================ ================= ================ ======

Add some noise:

.. code-block:: python

noise = [(0,1),(2,0),(0,4)]

for noisy in noise:
    df_iris.iloc[noisy] = None

df_iris.head()

= ================= ================ ================= ================ ====== _ sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target = ================= ================ ================= ================ ====== 0 5.1 NaN 1.4 0.2 None 1 4.9 3.0 1.4 0.2 0 2 NaN 3.2 1.3 0.2 0 3 4.6 3.1 1.5 0.2 0 4 5.0 3.6 1.4 0.2 0 = ================= ================ ================= ================ ======

Clean the DataFrame:

.. code-block:: python

df_iris_cleaned = boltzmannclean.clean(
    dataframe=df_iris,
    numerical_columns=[
        'sepal length (cm)', 'sepal width (cm)',
        'petal length (cm)', 'petal width (cm)'
    ],
    categorical_columns=['target'],
    tune_rbm=True
)

df_iris_cleaned.round(1).head()

= ================= ================ ================= ================ ====== _ sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target = ================= ================ ================= ================ ====== 0 5.1 3.3 1.4 0.2 0 1 4.9 3.0 1.4 0.2 0 2 6.3 3.2 1.3 0.2 0 3 4.6 3.1 1.5 0.2 0 4 5.0 3.6 1.4 0.2 0 = ================= ================ ================= ================ ======

The larger and more correlated the dataset is, the better the imputed values will be.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 23

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (1) 🔗