Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → dirty-cat → Dirty_cat

dirty-cat / Dirty_cat

Licence: bsd-3-clause

Encoding methods for dirty categorical variables

Programming Languages

139335 projects - #7 most used programming language

Labels

machine-learning data-science data-cleaning

Projects that are alternatives of or similar to Dirty cat

🚚 Agile Data Preparation Workflows made easy with dask, cudf, dask_cudf and pyspark

Stars: ✭ 986 (+280.69%)

Mutual labels: data-science, data-cleaning

Fill missing values in Pandas DataFrames using Restricted Boltzmann Machines

Stars: ✭ 23 (-91.12%)

Mutual labels: data-science, data-cleaning

Jupyter notebook and datasets from the pandas Q&A video series

Stars: ✭ 1,716 (+562.55%)

Mutual labels: data-science, data-cleaning

simple tools for data cleaning in R

Stars: ✭ 981 (+278.76%)

Mutual labels: data-science, data-cleaning

My Journey In The Data Science World

📢 Ready to learn or review your knowledge!

Stars: ✭ 1,175 (+353.67%)

Mutual labels: data-science, data-cleaning

General Assembly's 2015 Data Science course in Washington, DC

Stars: ✭ 1,516 (+485.33%)

Mutual labels: data-science, data-cleaning

Easy to use Python library of customized functions for cleaning and analyzing data.

Stars: ✭ 192 (-25.87%)

Mutual labels: data-science, data-cleaning

A toolbox of simple solutions for common data cleaning problems.

Stars: ✭ 21 (-91.89%)

Mutual labels: data-cleaning

🤖 A machine learning framework for audio, text, image, video, or .CSV files (50+ featurizers and 15+ model trainers).

Stars: ✭ 93 (-64.09%)

Mutual labels: data-cleaning

HoloClean-Legacy-deprecated

A Machine Learning System for Data Enrichment.

Stars: ✭ 75 (-71.04%)

Mutual labels: data-cleaning

R-Learning-Journey

Some of the projects i made when starting to learn R for Data Science at the university

Stars: ✭ 19 (-92.66%)

Mutual labels: data-cleaning

exemplary-ml-pipeline

Exemplary, annotated machine learning pipeline for any tabular data problem.

Stars: ✭ 23 (-91.12%)

Mutual labels: data-cleaning

nepali-translator

Neural Machine Translation on the Nepali-English language pair

Stars: ✭ 29 (-88.8%)

Mutual labels: data-cleaning

FIFA-2019-Analysis

This is a project based on the FIFA World Cup 2019 and Analyzes the Performance and Efficiency of Teams, Players, Countries and other related things using Data Analysis and Data Visualizations

Stars: ✭ 28 (-89.19%)

Mutual labels: data-cleaning

Distributed scikit-learn meta-estimators in PySpark

Stars: ✭ 260 (+0.39%)

Mutual labels: data-science

🚚 Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark

Stars: ✭ 1,351 (+421.62%)

Mutual labels: data-cleaning

DoWhy is a Python library for causal inference that supports explicit modeling and testing of causal assumptions. DoWhy is based on a unified language for causal inference, combining causal graphical models and potential outcomes frameworks.

Stars: ✭ 3,480 (+1243.63%)

Mutual labels: data-science

An Open Source, Self-Hosted Platform For Applied Deep Learning Development

Stars: ✭ 259 (+0%)

Mutual labels: data-science

OpenRefine-ecology-lesson

Data Cleaning with OpenRefine for Ecologists

Stars: ✭ 20 (-92.28%)

Mutual labels: data-cleaning

🚕 A spreadsheet-like data preparation web app that works over Optimus (Pandas, Dask, cuDF, Dask-cuDF, Spark and Vaex)

Stars: ✭ 120 (-53.67%)

Mutual labels: data-cleaning

View All Similar Projects ➔

dirty_cat

dirty_cat is a Python module for machine-learning on dirty categorical variables.

Website: https://dirty-cat.github.io/

For a detailed description of the problem of encoding dirty categorical data, see Similarity encoding for learning with dirty categorical variables <https://hal.inria.fr/hal-01806175>_ [1]_ and Encoding high-cardinality string categorical variables <https://hal.inria.fr/hal-02171256v4>_ [2]_.

Installation

Dependencies


dirty_cat requires:

- Python (>= 3.6)
- NumPy (>= 1.8.2)
- SciPy (>= 1.0.1)
- scikit-learn (>= 0.20.0)

Optional dependency:

- python-Levenshtein for faster edit distances (not used for the n-gram
  distance)

User installation

If you already have a working installation of NumPy and SciPy, the easiest way to install dirty_cat is using pip ::

pip install -U --user dirty_cat

Other implementations


-  Spark ML: https://github.com/rakutentech/spark-dirty-cat


References
~~~~~~~~~~

.. [1] Patricio Cerda, Gaël Varoquaux, Balázs Kégl. Similarity encoding for learning with dirty categorical variables. 2018. Machine Learning journal, Springer.
.. [2] Patricio Cerda, Gaël Varoquaux. Encoding high-cardinality string categorical variables. 2020. IEEE Transactions on Knowledge & Data Engineering.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 259

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (16) 🔗