All Projects → DenisVorotyntsev → Categoricalencodingbenchmark

DenisVorotyntsev / Categoricalencodingbenchmark

Benchmarking different approaches for categorical encoding for tabular data

Projects that are alternatives of or similar to Categoricalencodingbenchmark

Best Deep Learning Optimizers
Collection of the latest, greatest, deep learning optimizers (for Pytorch) - CNN, NLP suitable
Stars: ✭ 139 (-0.71%)
Mutual labels:  jupyter-notebook
Mnist draw
This is a sample project demonstrating the use of Keras (Tensorflow) for the training of a MNIST model for handwriting recognition using CoreML on iOS 11 for inference.
Stars: ✭ 139 (-0.71%)
Mutual labels:  jupyter-notebook
Ipytest
Pytest in IPython notebooks.
Stars: ✭ 139 (-0.71%)
Mutual labels:  jupyter-notebook
Python Engineer Notebooks
Stars: ✭ 139 (-0.71%)
Mutual labels:  jupyter-notebook
Pytorch Ppuu
Code for Prediction and Planning Under Uncertainty (PPUU)
Stars: ✭ 139 (-0.71%)
Mutual labels:  jupyter-notebook
Finrl Library
FinRL: Financial Reinforcement Learning Framework. Please star. 🔥
Stars: ✭ 3,037 (+2069.29%)
Mutual labels:  jupyter-notebook
99 Ml Learning Projects
A list of 99 machine learning projects for anyone interested to learn from coding and building projects
Stars: ✭ 139 (-0.71%)
Mutual labels:  jupyter-notebook
Dash eth
This project is trying to fetch real time balance & orderbook of ETH and visualise using dash
Stars: ✭ 140 (+0%)
Mutual labels:  jupyter-notebook
Python For Data Science
A blog for data analytics using data science technologies
Stars: ✭ 139 (-0.71%)
Mutual labels:  jupyter-notebook
Coco Text
COCO-Text API http://vision.cornell.edu/se3/coco-text/
Stars: ✭ 138 (-1.43%)
Mutual labels:  jupyter-notebook
Multi Task Refinenet
Multi-Task (Joint Segmentation / Depth / Surface Normas) Real-Time Light-Weight RefineNet
Stars: ✭ 139 (-0.71%)
Mutual labels:  jupyter-notebook
Usiigaci
Usiigaci: stain-free cell tracking in phase contrast microscopy enabled by supervised machine learning
Stars: ✭ 139 (-0.71%)
Mutual labels:  jupyter-notebook
Widendeep
wide and deep labs and samples
Stars: ✭ 139 (-0.71%)
Mutual labels:  jupyter-notebook
Tensorflow In Practice Specialization
Tensorflow-in-Practice-Specialization
Stars: ✭ 139 (-0.71%)
Mutual labels:  jupyter-notebook
Intro To Tensorflow
This is an introduction to tensorflow
Stars: ✭ 139 (-0.71%)
Mutual labels:  jupyter-notebook
Hadith Data Sets
All Hadith With Tashkil and Without Tashkel from the Nine Books that are 62,169 Hadith.
Stars: ✭ 139 (-0.71%)
Mutual labels:  jupyter-notebook
Pytorch Faster Rcnn
pytorch1.0 updated. Support cpu test and demo. (Use detectron2, it's a masterpiece)
Stars: ✭ 1,779 (+1170.71%)
Mutual labels:  jupyter-notebook
Voc2coco
How to create custom COCO data set for object detection
Stars: ✭ 140 (+0%)
Mutual labels:  jupyter-notebook
Nbconvert Examples
Examples that illustrate how nbconvert can be used
Stars: ✭ 139 (-0.71%)
Mutual labels:  jupyter-notebook
Crypto Rnn
Learning the Enigma with Recurrent Neural Networks
Stars: ✭ 139 (-0.71%)
Mutual labels:  jupyter-notebook

CategoricalEncodingBenchmark

Benchmarking different approaches for categorical encoding

Reproducibility of results

Requirements

pip install -r requirements.txt

Benchmark the dataset

To benchmark encoders for your dataset:

  1. Install libraries in requirements

  2. Process the dataset as shown in notebooks/1-prepare-datasets.ipynb

  3. Add name of the dataset in dataset_list in src/run_experiment.py

  4. python run_experiment.py

  5. Run notebooks/2-show-results.ipynb

Used datasets and raw scores

All datasets except poverty_A(B,C) came from different domains; they have a different number of observations, number of categorical and numerical features. The objective for all datasets - binary classification. Preprocessing of datasets were simple: I removed all time-based columns from datasets. Remaining columns were either categorical or numerical. Details of the experiments could be found in my blog post: Benchmarking Categorical Encoders.

Table 1.1 Used datasets

Name Total points Train points Test points Number of features Number of categorical features Short description
Telecom 7.0k 4.2k 2.8k 20 16 Churn prediction for telecom data
Adult 48.8k 29.3k 19.5k 15 8 Predict if persons' income is bigger 50k
Employee 32.7k 19.6k 13.1k 10 9 Predict an employee's access needs, given his/her job role
Credit 307.5k 184.5k 123k 121 18 Loan repayment
Mortgages 45.6k 27.4k 18.2k 20 9 Predict if house mortgage is founded
Promotion 54.8 32.8k 21.9k 13 5 Predict if an employee will get a promotion
Kick 72.9k 43.7k 29.1k 32 19 Predict if a car purchased at auction is good/bad buy
Kdd_upselling 50k 30k 20k 230 40 Predict up-selling for a customer
Taxi 892.5k 535.5k 357k 8 5 Predict the probability of an offer being accepted by a certain driver
Poverty_A 37.6k 22.5k 15.0k 41 38 Predict whether or not a given household for a given country is poor or not
Poverty_B 20.2k 12.1k 8.1k 224 191 Predict whether or not a given household for a given country is poor or not
Poverty_C 29.9k 17.9k 11.9k 41 35 Predict whether or not a given household for a given country is poor or not

The ROC AUC scores for each dataset are presented in tables below. Note: some experiments required too much memory to run, so some values are missing.

Table 1.2 ROC AUC scores for None Validation

telecom adult employee credit mortgages promotion kick kdd_upselling taxi poverty_A poverty_B poverty_C
BackwardDifferenceEncoder 0.6454 0.8555 0.5006 0.7442 0.5997 0.6482 0.5149 0.5484 0.4945
CatBoostEncoder 0.7666 0.868 0.5004 0.7478 0.6279 0.7811 0.6583 0.8549 0.5477 0.5179 0.5638 0.5427
FrequencyEncoder 0.8405 0.9291 0.807 0.7593 0.6949 0.9052 0.7907 0.8643 0.5656 0.7276 0.6164 0.7177
HelmertEncoder 0.8404 0.9297 0.83 0.7601 0.7001 0.9079 0.7325 0.6343 0.7168
JamesSteinEncoder 0.7195 0.8688 0.5003 0.7485 0.6049 0.7984 0.6592 0.8516 0.5432 0.4918 0.5304 0.4836
LeaveOneOutEncoder 0.5 0.5214 0.6233 0.4957 0.5 0.5457 0.5027 0.5 0.5 0.5006 0.5002 0.4527
MEstimateEncoder 0.6944 0.8617 0.4998 0.7368 0.6086 0.8156 0.653 0.8448 0.5091 0.5254 0.434 0.4528
OrdinalEncoder 0.7409 0.8616 0.501 0.7445 0.6008 0.7124 0.6531 0.8448 0.5498 0.473 0.4683 0.5611
SumEncoder 0.8404 0.929 0.8053 0.7593 0.6944 0.9073 0.7355 0.6206 0.7372
TargetEncoder 0.7195 0.8696 0.5003 0.7483 0.6064 0.7971 0.6594 0.8483 0.5428 0.4955 0.5401 0.4751
WOEEncoder 0.7056 0.8645 0.5012 0.7439 0.615 0.7345 0.6398 0.844 0.5485 0.478 0.5356 0.4671

Table 1.3 ROC AUC scores for Single Validation

telecom adult employee credit mortgages promotion kick kdd_upselling taxi poverty_A poverty_B poverty_C
BackwardDifferenceEncoder 0.8382 0.9293 0.7569 0.7595 0.6894 0.9064 0.7323 0.6151 0.7108
CatBoostEncoder 0.8392 0.9292 0.8498 0.7594 0.6951 0.8918 0.7901 0.8654 0.5844 0.7429 0.6902 0.7333
FrequencyEncoder 0.8392 0.9293 0.8138 0.7592 0.6937 0.9055 0.7902 0.8634 0.582 0.7302 0.6128 0.7195
HelmertEncoder 0.8404 0.9297 0.8344 0.7597 0.7027 0.9083 0.7297 0.6374 0.7196
JamesSteinEncoder 0.8388 0.9292 0.7817 0.7597 0.667 0.9053 0.5835 0.726 0.5898 0.7303 0.6764 0.7217
LeaveOneOutEncoder 0.5 0.5182 0.6121 0.4997 0.5 0.5403 0.4682 0.5 0.5 0.5103 0.5 0.4959
MEstimateEncoder 0.8394 0.929 0.7353 0.7593 0.6957 0.9054 0.5877 0.5953 0.5946 0.7302 0.6493 0.7076
OrdinalEncoder 0.8404 0.9299 0.8274 0.7585 0.6917 0.9078 0.7809 0.8465 0.6034 0.7337 0.6635 0.742
SumEncoder 0.8404 0.929 0.8053 0.7593 0.6944 0.9073 0.7355 0.6206 0.7372
TargetEncoder 0.8388 0.9293 0.815 0.7599 0.6702 0.9057 0.7042 0.713 0.5894 0.7292 0.6742 0.7207
WOEEncoder 0.8393 0.9294 0.8325 0.7599 0.6801 0.9056 0.7172 0.8391 0.5903 0.7279 0.6737 0.7224

Table 1.4 ROC AUC scores for Double Validation

telecom adult employee credit mortgages promotion kick kdd_upselling taxi poverty_A poverty_B poverty_C
CatBoostEncoder 0.8394 0.9293 0.8529 0.7592 0.6967 0.9056 0.7899 0.8633 0.6031 0.7418 0.6902 0.7343
FrequencyEncoder 0.8371 0.9221 0.5563 0.755 0.6582 0.8749 0.7655 0.8551 0.5657 0.6873 0.6037 0.6961
JamesSteinEncoder 0.8398 0.9296 0.8489 0.7598 0.6981 0.905 0.7901 0.8628 0.6033 0.7412 0.6895 0.7366
LeaveOneOutEncoder 0.8393 0.9295 0.8496 0.7595 0.6963 0.9055 0.7902 0.8635 0.602 0.7416 0.6931 0.7345
MEstimateEncoder 0.8405 0.9292 0.8125 0.7597 0.6939 0.9063 0.7881 0.863 0.5984 0.7375 0.6801 0.7204
TargetEncoder 0.8393 0.9294 0.8537 0.7596 0.6954 0.9057 0.7909 0.8643 0.6025 0.7415 0.6903 0.7352
WOEEncoder 0.8401 0.9294 0.824 0.7599 0.6977 0.9041 0.7905 0.8631 0.6011 0.7407 0.6911 0.7345

Results

To determine the best encoder, I scaled the ROC AUC scores of each dataset (min-max scale) and then averaged results among the encoder. The obtained result represents the average performance score for each encoder (higher is better). The encoders performance scores for each type of validation are shown in tables 2.1–2.3. 

To determine the best validation strategy, I compared the top score of each dataset for each type of validation. The scores improvement (top score for a dataset and an average score for encoder) are shown in table 2.4 and 2.5 below.

Table 2.1 Encoders performance scores - None Validation

None Validation
HelmertEncoder 0.9517
SumEncoder 0.9434
FrequencyEncoder 0.9176
CatBoostEncoder 0.5728
TargetEncoder 0.5174
JamesSteinEncoder 0.5162
OrdinalEncoder 0.4964
WOEEncoder 0.4905
MEstimateEncoder 0.4501
BackwardDifferenceEncoder 0.4128
LeaveOneOutEncoder 0.0697

Table 2.2 Encoders performance scores - Single Validation

Single Validation
CatBoostEncoder 0.9726
OrdinalEncoder 0.9694
HelmertEncoder 0.9558
SumEncoder 0.9434
WOEEncoder 0.9326
FrequencyEncoder 0.9315
BackwardDifferenceEncoder 0.9108
TargetEncoder 0.8915
JamesSteinEncoder 0.8555
MEstimateEncoder 0.8189
LeaveOneOutEncoder 0.0729

Table 2.3 Encoders performance scores - Double Validation

Double Validation
JamesSteinEncoder 0.9918
CatBoostEncoder 0.9917
TargetEncoder 0.9916
LeaveOneOutEncoder 0.9909
WOEEncoder 0.9838
MEstimateEncoder 0.9686
FrequencyEncoder 0.8018

Table 2.4 Top score improvement (percent)

None -> Single Single -> Double
telecom 0.00 0.01
adult 0.02 -0.03
employee 1.98 0.39
credit -0.01 -0.00
mortgages 0.26 -0.47
promotion 0.04 -0.20
kick -0.05 0.06
kdd_upselling 0.10 -0.11
taxi 3.78 -0.01
poverty_A 0.74 -0.11
poverty_B 5.59 0.29
poverty_C 0.48 -0.54

Table 2.5 Encoders performance scores improvement (percent)

None -> Single Single -> Double
BackwardDifferenceEncoder 27.20
CatBoostEncoder 20.10 0.40
FrequencyEncoder 0.30 -4.90
HelmertEncoder 0.20
JamesSteinEncoder 17.70 6.30
LeaveOneOutEncoder 0.20 53.20
MEstimateEncoder 18.90 8.10
OrdinalEncoder 24.10
SumEncoder 0.00
TargetEncoder 19.60 4.20
WOEEncoder 23.40 1.90
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].