All Projects → joeddav → get_smarties

joeddav / get_smarties

Licence: MIT license
Dummy variable generation with fit/transform capabilities

Programming Languages

python
139335 projects - #7 most used programming language
Jupyter Notebook
11667 projects

Projects that are alternatives of or similar to get smarties

trt pose hand
Real-time hand pose estimation and gesture classification using TensorRT
Stars: ✭ 137 (+495.65%)
Mutual labels:  sklearn
object-detection-with-svm-and-opencv
detect objects using svm and opencv
Stars: ✭ 24 (+4.35%)
Mutual labels:  sklearn
papilo
DEPRECATED: Stream data processing micro-framework
Stars: ✭ 24 (+4.35%)
Mutual labels:  data-engineering
qsv
CSVs sliced, diced & analyzed.
Stars: ✭ 438 (+1804.35%)
Mutual labels:  data-engineering
tymon
An AI Assistant More Than a Toolkit
Stars: ✭ 46 (+100%)
Mutual labels:  sklearn
ml course
"Learning Machine Learning" Course, Bogotá, Colombia 2019 #LML2019
Stars: ✭ 22 (-4.35%)
Mutual labels:  sklearn
Word2VecAndTsne
Scripts demo-ing how to train a Word2Vec model and reduce its vector space
Stars: ✭ 45 (+95.65%)
Mutual labels:  sklearn
contessa
Easy way to define, execute and store quality rules for your data.
Stars: ✭ 17 (-26.09%)
Mutual labels:  data-engineering
soda-spark
Soda Spark is a PySpark library that helps you with testing your data in Spark Dataframes
Stars: ✭ 58 (+152.17%)
Mutual labels:  data-engineering
topic modelling financial news
Topic modelling on financial news with Natural Language Processing
Stars: ✭ 51 (+121.74%)
Mutual labels:  sklearn
techloop-ml-plus
Archives and Tasks for ML+ sessions
Stars: ✭ 23 (+0%)
Mutual labels:  sklearn
machine-learning-templates
Template codes and examples for Python machine learning concepts
Stars: ✭ 40 (+73.91%)
Mutual labels:  sklearn
big-data-engineering-indonesia
A curated list of big data engineering tools, resources and communities.
Stars: ✭ 26 (+13.04%)
Mutual labels:  data-engineering
ml
Base machine learning image and environment.
Stars: ✭ 15 (-34.78%)
Mutual labels:  sklearn
Everything-Tech
A collection of online resources to help you on your Tech journey.
Stars: ✭ 396 (+1621.74%)
Mutual labels:  data-engineering
hive-metastore-client
A client for connecting and running DDLs on hive metastore.
Stars: ✭ 37 (+60.87%)
Mutual labels:  data-engineering
lrmr
Less-Resilient MapReduce framework for Go
Stars: ✭ 32 (+39.13%)
Mutual labels:  data-engineering
foreshadow
An automatic machine learning system
Stars: ✭ 29 (+26.09%)
Mutual labels:  sklearn
KivyMLApp
The repository host the API for the ML model via FastAPI, Flask and contains android app development files using KivyMD.
Stars: ✭ 54 (+134.78%)
Mutual labels:  sklearn
flask-angular-data-science
Repository for a data science starter app using Flask, Angular and Docker. https://medium.com/@dvelsner/deploying-a-simple-machine-learning-model-in-a-modern-web-application-flask-angular-docker-a657db075280
Stars: ✭ 84 (+265.22%)
Mutual labels:  sklearn

get_smarties

Like pd.get_dummies... but smarter.

The problem

When working with a categorical dataset, most use the pandas.get_dummies function for easy dummy variable generation. This is well and good, until you have to compare two subsets of your dataset (as in prediction). If your subsets don't have a row for each possible value for some feature, your resulting datasets will be different shapes.

For example, say we have a datset with a 'gender' with two possible values: Male and Female.

...gender
1...Male
2...Female
3...Male

The pd.get_dummies function would give you:

...gender_Malegender_Female
1...10
2...01
3...10

But now, say we have another instance and do some machine learning voodoo to predict their gender. Say we predict a male. get_dummies would give:

...gender_Male
1...1

Since Pandas never saw a Female in this subset, it only generates a category for Male. The result is that your new and original samples have different shapes, making all kinds of trouble for computing loss, for example.

See more discussion of this issue at this thread.

The solution

get_smarties allows you to easily generate dummy variables while persisting the possible values under each category for you. You can use conventional fit_transform and transform methods and solve this problem with virtually no additional effort, like so:

from get_smarties import Smarties
gs = Smarties()

# generate dummies on original dataset, store values for later
X = gs.fit_transform(data)

# generate more dummies on new sample using previously stored values
Y = gs.transform(prediction)

Pipelines

Because get_smarties has fit/transform capabilities, you can even inject your dummy variable creation directly sklearn pipelines:

training_pipeline = Pipeline([
    ('smarties', Smarties()),
    ('clf', MultinomialNB()),
])

training_pipeline.fit(data, labels)

See a working example with k-fold cross validation at kfold-pipeline-demo.ipynb.

Setup

With pip, simply run

pip install -e git+https://github.com/joeddav/get_smarties.git#egg=get_smarties
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].