joeddav / get_smarties

Licence: MIT license

Dummy variable generation with fit/transform capabilities

Programming Languages

python

139335 projects - #7 most used programming language

Jupyter Notebook

11667 projects

Projects that are alternatives of or similar to get smarties

trt pose hand

Real-time hand pose estimation and gesture classification using TensorRT

Stars: ✭ 137 (+495.65%)

Mutual labels: sklearn

object-detection-with-svm-and-opencv

detect objects using svm and opencv

Stars: ✭ 24 (+4.35%)

Mutual labels: sklearn

papilo

DEPRECATED: Stream data processing micro-framework

Stars: ✭ 24 (+4.35%)

Mutual labels: data-engineering

qsv

CSVs sliced, diced & analyzed.

Stars: ✭ 438 (+1804.35%)

Mutual labels: data-engineering

tymon

An AI Assistant More Than a Toolkit

Stars: ✭ 46 (+100%)

Mutual labels: sklearn

ml course

"Learning Machine Learning" Course, Bogotá, Colombia 2019 #LML2019

Stars: ✭ 22 (-4.35%)

Mutual labels: sklearn

Word2VecAndTsne

Scripts demo-ing how to train a Word2Vec model and reduce its vector space

Stars: ✭ 45 (+95.65%)

Mutual labels: sklearn

contessa

Easy way to define, execute and store quality rules for your data.

Stars: ✭ 17 (-26.09%)

Mutual labels: data-engineering

soda-spark

Soda Spark is a PySpark library that helps you with testing your data in Spark Dataframes

Stars: ✭ 58 (+152.17%)

Mutual labels: data-engineering

topic modelling financial news

Topic modelling on financial news with Natural Language Processing

Stars: ✭ 51 (+121.74%)

Mutual labels: sklearn

techloop-ml-plus

Archives and Tasks for ML+ sessions

Stars: ✭ 23 (+0%)

Mutual labels: sklearn

machine-learning-templates

Template codes and examples for Python machine learning concepts

Stars: ✭ 40 (+73.91%)

Mutual labels: sklearn

big-data-engineering-indonesia

A curated list of big data engineering tools, resources and communities.

Stars: ✭ 26 (+13.04%)

Mutual labels: data-engineering

Base machine learning image and environment.

Stars: ✭ 15 (-34.78%)

Mutual labels: sklearn

Everything-Tech

A collection of online resources to help you on your Tech journey.

Stars: ✭ 396 (+1621.74%)

Mutual labels: data-engineering

hive-metastore-client

A client for connecting and running DDLs on hive metastore.

Stars: ✭ 37 (+60.87%)

Mutual labels: data-engineering

lrmr

Less-Resilient MapReduce framework for Go

Stars: ✭ 32 (+39.13%)

Mutual labels: data-engineering

foreshadow

An automatic machine learning system

Stars: ✭ 29 (+26.09%)

Mutual labels: sklearn

KivyMLApp

The repository host the API for the ML model via FastAPI, Flask and contains android app development files using KivyMD.

Stars: ✭ 54 (+134.78%)

Mutual labels: sklearn

flask-angular-data-science

Repository for a data science starter app using Flask, Angular and Docker. https://medium.com/@dvelsner/deploying-a-simple-machine-learning-model-in-a-modern-web-application-flask-angular-docker-a657db075280

Stars: ✭ 84 (+265.22%)

Mutual labels: sklearn

View All Similar Projects ➔

get_smarties

Like pd.get_dummies... but smarter.

The problem

When working with a categorical dataset, most use the pandas.get_dummies function for easy dummy variable generation. This is well and good, until you have to compare two subsets of your dataset (as in prediction). If your subsets don't have a row for each possible value for some feature, your resulting datasets will be different shapes.

For example, say we have a datset with a 'gender' with two possible values: Male and Female.

	...	gender
1	...	Male
2	...	Female
3	...	Male

The pd.get_dummies function would give you:

	...	gender_Male	gender_Female
1	...	1	0
2	...	0	1
3	...	1	0

But now, say we have another instance and do some machine learning voodoo to predict their gender. Say we predict a male. get_dummies would give:

	...	gender_Male
1	...	1

Since Pandas never saw a Female in this subset, it only generates a category for Male. The result is that your new and original samples have different shapes, making all kinds of trouble for computing loss, for example.

See more discussion of this issue at this thread.

The solution

get_smarties allows you to easily generate dummy variables while persisting the possible values under each category for you. You can use conventional fit_transform and transform methods and solve this problem with virtually no additional effort, like so:

from get_smarties import Smarties
gs = Smarties()

# generate dummies on original dataset, store values for later
X = gs.fit_transform(data)

# generate more dummies on new sample using previously stored values
Y = gs.transform(prediction)

Pipelines

Because get_smarties has fit/transform capabilities, you can even inject your dummy variable creation directly sklearn pipelines:

training_pipeline = Pipeline([
    ('smarties', Smarties()),
    ('clf', MultinomialNB()),
])

training_pipeline.fit(data, labels)

See a working example with k-fold cross validation at kfold-pipeline-demo.ipynb.

Setup

With pip, simply run

pip install -e git+https://github.com/joeddav/get_smarties.git#egg=get_smarties

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

joeddav / get_smarties

Programming Languages

Labels

Projects that are alternatives of or similar to get smarties

get_smarties

The problem

The solution

Pipelines

Setup