All Projects → dirty-cat → Dirty_cat

dirty-cat / Dirty_cat

Licence: bsd-3-clause
Encoding methods for dirty categorical variables

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Dirty cat

Optimus
🚚 Agile Data Preparation Workflows made easy with dask, cudf, dask_cudf and pyspark
Stars: ✭ 986 (+280.69%)
Mutual labels:  data-science, data-cleaning
Boltzmannclean
Fill missing values in Pandas DataFrames using Restricted Boltzmann Machines
Stars: ✭ 23 (-91.12%)
Mutual labels:  data-science, data-cleaning
Pandas Videos
Jupyter notebook and datasets from the pandas Q&A video series
Stars: ✭ 1,716 (+562.55%)
Mutual labels:  data-science, data-cleaning
Janitor
simple tools for data cleaning in R
Stars: ✭ 981 (+278.76%)
Mutual labels:  data-science, data-cleaning
My Journey In The Data Science World
📢 Ready to learn or review your knowledge!
Stars: ✭ 1,175 (+353.67%)
Mutual labels:  data-science, data-cleaning
Dat8
General Assembly's 2015 Data Science course in Washington, DC
Stars: ✭ 1,516 (+485.33%)
Mutual labels:  data-science, data-cleaning
Klib
Easy to use Python library of customized functions for cleaning and analyzing data.
Stars: ✭ 192 (-25.87%)
Mutual labels:  data-science, data-cleaning
Cleaner.jl
A toolbox of simple solutions for common data cleaning problems.
Stars: ✭ 21 (-91.89%)
Mutual labels:  data-cleaning
allie
🤖 A machine learning framework for audio, text, image, video, or .CSV files (50+ featurizers and 15+ model trainers).
Stars: ✭ 93 (-64.09%)
Mutual labels:  data-cleaning
HoloClean-Legacy-deprecated
A Machine Learning System for Data Enrichment.
Stars: ✭ 75 (-71.04%)
Mutual labels:  data-cleaning
R-Learning-Journey
Some of the projects i made when starting to learn R for Data Science at the university
Stars: ✭ 19 (-92.66%)
Mutual labels:  data-cleaning
exemplary-ml-pipeline
Exemplary, annotated machine learning pipeline for any tabular data problem.
Stars: ✭ 23 (-91.12%)
Mutual labels:  data-cleaning
nepali-translator
Neural Machine Translation on the Nepali-English language pair
Stars: ✭ 29 (-88.8%)
Mutual labels:  data-cleaning
FIFA-2019-Analysis
This is a project based on the FIFA World Cup 2019 and Analyzes the Performance and Efficiency of Teams, Players, Countries and other related things using Data Analysis and Data Visualizations
Stars: ✭ 28 (-89.19%)
Mutual labels:  data-cleaning
Sk Dist
Distributed scikit-learn meta-estimators in PySpark
Stars: ✭ 260 (+0.39%)
Mutual labels:  data-science
optimus
🚚 Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark
Stars: ✭ 1,351 (+421.62%)
Mutual labels:  data-cleaning
Dowhy
DoWhy is a Python library for causal inference that supports explicit modeling and testing of causal assumptions. DoWhy is based on a unified language for causal inference, combining causal graphical models and potential outcomes frameworks.
Stars: ✭ 3,480 (+1243.63%)
Mutual labels:  data-science
Atlas
An Open Source, Self-Hosted Platform For Applied Deep Learning Development
Stars: ✭ 259 (+0%)
Mutual labels:  data-science
OpenRefine-ecology-lesson
Data Cleaning with OpenRefine for Ecologists
Stars: ✭ 20 (-92.28%)
Mutual labels:  data-cleaning
bumblebee
🚕 A spreadsheet-like data preparation web app that works over Optimus (Pandas, Dask, cuDF, Dask-cuDF, Spark and Vaex)
Stars: ✭ 120 (-53.67%)
Mutual labels:  data-cleaning

dirty_cat

dirty_cat is a Python module for machine-learning on dirty categorical variables.

Website: https://dirty-cat.github.io/

For a detailed description of the problem of encoding dirty categorical data, see Similarity encoding for learning with dirty categorical variables <https://hal.inria.fr/hal-01806175>_ [1]_ and Encoding high-cardinality string categorical variables <https://hal.inria.fr/hal-02171256v4>_ [2]_.

Installation

Dependencies


dirty_cat requires:

- Python (>= 3.6)
- NumPy (>= 1.8.2)
- SciPy (>= 1.0.1)
- scikit-learn (>= 0.20.0)

Optional dependency:

- python-Levenshtein for faster edit distances (not used for the n-gram
  distance)

User installation

If you already have a working installation of NumPy and SciPy, the easiest way to install dirty_cat is using pip ::

pip install -U --user dirty_cat

Other implementations


-  Spark ML: https://github.com/rakutentech/spark-dirty-cat


References
~~~~~~~~~~

.. [1] Patricio Cerda, Gaël Varoquaux, Balázs Kégl. Similarity encoding for learning with dirty categorical variables. 2018. Machine Learning journal, Springer.
.. [2] Patricio Cerda, Gaël Varoquaux. Encoding high-cardinality string categorical variables. 2020. IEEE Transactions on Knowledge & Data Engineering.
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].