Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → rhiever → Datacleaner

rhiever / Datacleaner

Licence: mit

A Python tool that automatically cleans data sets and readies them for analysis.

Programming Languages

139335 projects - #7 most used programming language

Labels

machine-learning data-science automation

Projects that are alternatives of or similar to Datacleaner

Data processing and modelling framework for automating tasks (incl. Python & SQL transformations).

Stars: ✭ 79 (-91.53%)

Mutual labels: automation, data-science

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.

Stars: ✭ 8,378 (+797.96%)

Mutual labels: automation, data-science

a delightful machine learning tool that allows you to train, test, and use models without writing code

Stars: ✭ 2,956 (+216.83%)

Mutual labels: automation, data-science

An open-source platform for automating tasks using machine learning models

Stars: ✭ 61 (-93.46%)

Mutual labels: automation, data-science

Maze Applied Reinforcement Learning Framework

Stars: ✭ 85 (-90.89%)

Mutual labels: automation, data-science

Automated Data Science and Machine Learning library to optimize workflow.

Stars: ✭ 94 (-89.92%)

Mutual labels: automation, data-science

The easiest way to automate your data

Stars: ✭ 7,956 (+752.73%)

Mutual labels: automation, data-science

4th Place Home Credit Default Risk

Codes and dashboards for 4th place solution for Kaggle's Home Credit Default Risk competition

Stars: ✭ 23 (-97.53%)

Mutual labels: data-science

Bernard is a voice assistant developed with gTTS. It can fulfill basic and simple tasks you give.

Stars: ✭ 24 (-97.43%)

Mutual labels: automation

Model Describer

model-describer : Making machine learning interpretable to humans

Stars: ✭ 22 (-97.64%)

Mutual labels: data-science

A command line visual file manager for linux

Stars: ✭ 22 (-97.64%)

Mutual labels: automation

Har Keras Coreml

Human Activity Recognition (HAR) with Keras and CoreML

Stars: ✭ 23 (-97.53%)

Mutual labels: data-science

Python Introducing Pandas

Introduction to pandas Treehouse course

Stars: ✭ 24 (-97.43%)

Mutual labels: data-science

Python Robotic Process Automation Library

Stars: ✭ 23 (-97.53%)

Mutual labels: automation

Kubeflow Data Science On Steroids

The blog post about Kubeflow, including all materials

Stars: ✭ 25 (-97.32%)

Mutual labels: data-science

Lambdaschooldatascience

Completed assignments and coding challenges from the Lambda School Data Science program.

Stars: ✭ 22 (-97.64%)

Mutual labels: data-science

Notes for using R language to do data mining and machine learning (Chinese)

Stars: ✭ 25 (-97.32%)

Mutual labels: data-science

Awesome Hammerspoon

awesome configuration for Hammerspoon.

Stars: ✭ 928 (-0.54%)

Mutual labels: automation

An Open Source VDI management solution to allow running virtual desktops in a RHEV/Ovirt environment seamlessly

Stars: ✭ 23 (-97.53%)

Mutual labels: automation

Cisco IOS diff tool

Stars: ✭ 23 (-97.53%)

Mutual labels: automation

View All Similar Projects ➔

datacleaner

A Python tool that automatically cleans data sets and readies them for analysis.

datacleaner is not magic

datacleaner works with data in pandas DataFrames.

datacleaner is not magic, and it won't take an unorganized blob of text and automagically parse it out for you.

What datacleaner will do is save you a ton of time encoding and cleaning your data once it's already in a format that pandas DataFrames can handle.

Currently, datacleaner does the following:

Optionally drops any row with a missing value
Replaces missing values with the mode (for categorical variables) or median (for continuous variables) on a column-by-column basis
Encodes non-numerical variables (e.g., categorical variables with strings) with numerical equivalents

We plan to add more cleaning features as the project grows.

License

Please see the repository license for the licensing and usage information for datacleaner.

Generally, we have licensed datacleaner to make it as widely usable as possible.

Installation

datacleaner is built to use pandas DataFrames and some scikit-learn modules for data preprocessing. As such, we recommend installing the Anaconda Python distribution prior to installing datacleaner.

Once the prerequisites are installed, datacleaner can be installed with a simple pip command:

pip install datacleaner

Usage

datacleaner on the command line

datacleaner can be used on the command line. Use --help to see its usage instructions.

usage: datacleaner [-h] [-cv CROSS_VAL_FILENAME] [-o OUTPUT_FILENAME]
                   [-cvo CV_OUTPUT_FILENAME] [-is INPUT_SEPARATOR]
                   [-os OUTPUT_SEPARATOR] [--drop-nans]
                   [--ignore-update-check] [--version]
                   INPUT_FILENAME

A Python tool that automatically cleans data sets and readies them for analysis

positional arguments:
  INPUT_FILENAME        File name of the data file to clean

optional arguments:
  -h, --help            show this help message and exit
  -cv CROSS_VAL_FILENAME
                        File name for the validation data set if performing
                        cross-validation
  -o OUTPUT_FILENAME    Data file to output the cleaned data set to
  -cvo CV_OUTPUT_FILENAME
                        Data file to output the cleaned cross-validation data
                        set to
  -is INPUT_SEPARATOR   Column separator for the input file(s) (default: \t)
  -os OUTPUT_SEPARATOR  Column separator for the output file(s) (default: \t)
  --drop-nans           Drop all rows that have a NaN in any column (default: False)
  --ignore-update-check
                        Do not check for the latest version of datacleaner
                        (default: False)
  --version             show program's version number and exit

An example command-line call to datacleaner may look like:

datacleaner my_data.csv -o my_clean.data.csv -is , -os ,

which will read the data from my_data.csv (assuming columns are separated by commas), clean the data set, then output the resulting data set to my_clean.data.csv.

datacleaner in scripts

datacleaner can also be used as part of a script. There are two primary functions implemented in datacleaner: autoclean and autoclean_cv.

autoclean(input_dataframe, drop_nans=False, copy=False, ignore_update_check=False)
    Performs a series of automated data cleaning transformations on the provided data set
    
    Parameters
    ----------
    input_dataframe: pandas.DataFrame
        Data set to clean
    drop_nans: bool
        Drop all rows that have a NaN in any column (default: False)
    copy: bool
        Make a copy of the data set (default: False) 
    encoder: category_encoders transformer
        The a valid category_encoders transformer which is passed an inferred cols list. Default (None: LabelEncoder)
    encoder_kwargs: category_encoders
        The a valid sklearn transformer to encode categorical features. Default (None)
    ignore_update_check: bool
        Do not check for the latest version of datacleaner

    Returns
    ----------
    output_dataframe: pandas.DataFrame
        Cleaned data set

autoclean_cv(training_dataframe, testing_dataframe, drop_nans=False, copy=False, ignore_update_check=False)
    Performs a series of automated data cleaning transformations on the provided training and testing data sets
    
    Unlike `autoclean()`, this function takes cross-validation into account by learning the data transformations
    from only the training set, then applying those transformations to both the training and testing set.
    By doing so, this function will prevent information leak from the training set into the testing set.
    
    Parameters
    ----------
    training_dataframe: pandas.DataFrame
        Training data set
    testing_dataframe: pandas.DataFrame
        Testing data set
    drop_nans: bool
        Drop all rows that have a NaN in any column (default: False)
    copy: bool
        Make a copy of the data set (default: False)  
    encoder: category_encoders transformer
        The a valid category_encoders transformer which is passed an inferred cols list. Default (None: LabelEncoder)
    encoder_kwargs: category_encoders
        The a valid sklearn transformer to encode categorical features. Default (None)
    ignore_update_check: bool
        Do not check for the latest version of datacleaner

    Returns
    ----------
    output_training_dataframe: pandas.DataFrame
        Cleaned training data set
    output_testing_dataframe: pandas.DataFrame
        Cleaned testing data set

Below is an example of datacleaner performing basic cleaning on a data set.

from datacleaner import autoclean
import pandas as pd

my_data = pd.read_csv('my_data.csv', sep=',')
my_clean_data = autoclean(my_data)
my_data.to_csv('my_clean_data.csv', sep=',', index=False)

Note that because datacleaner works directly on pandas DataFrames, all DataFrame operations are still available to the resulting data sets.

Contributing to datacleaner

We welcome you to check the existing issues for bugs or enhancements to work on. If you have an idea for an extension to datacleaner, please file a new issue so we can discuss it.

Citing datacleaner

If you use datacleaner as part of your workflow in a scientific publication, please consider citing the datacleaner repository with the following DOI:

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 933

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (12) 🔗