Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → DenisVorotyntsev → Categoricalencodingbenchmark

DenisVorotyntsev / Categoricalencodingbenchmark

Benchmarking different approaches for categorical encoding for tabular data

Labels

jupyter-notebook

Projects that are alternatives of or similar to Categoricalencodingbenchmark

Best Deep Learning Optimizers

Collection of the latest, greatest, deep learning optimizers (for Pytorch) - CNN, NLP suitable

Stars: ✭ 139 (-0.71%)

Mutual labels: jupyter-notebook

This is a sample project demonstrating the use of Keras (Tensorflow) for the training of a MNIST model for handwriting recognition using CoreML on iOS 11 for inference.

Stars: ✭ 139 (-0.71%)

Mutual labels: jupyter-notebook

Pytest in IPython notebooks.

Stars: ✭ 139 (-0.71%)

Mutual labels: jupyter-notebook

Python Engineer Notebooks

Stars: ✭ 139 (-0.71%)

Mutual labels: jupyter-notebook

Code for Prediction and Planning Under Uncertainty (PPUU)

Stars: ✭ 139 (-0.71%)

Mutual labels: jupyter-notebook

FinRL: Financial Reinforcement Learning Framework. Please star. 🔥

Stars: ✭ 3,037 (+2069.29%)

Mutual labels: jupyter-notebook

99 Ml Learning Projects

A list of 99 machine learning projects for anyone interested to learn from coding and building projects

Stars: ✭ 139 (-0.71%)

Mutual labels: jupyter-notebook

This project is trying to fetch real time balance & orderbook of ETH and visualise using dash

Stars: ✭ 140 (+0%)

Mutual labels: jupyter-notebook

Python For Data Science

A blog for data analytics using data science technologies

Stars: ✭ 139 (-0.71%)

Mutual labels: jupyter-notebook

COCO-Text API http://vision.cornell.edu/se3/coco-text/

Stars: ✭ 138 (-1.43%)

Mutual labels: jupyter-notebook

Multi Task Refinenet

Multi-Task (Joint Segmentation / Depth / Surface Normas) Real-Time Light-Weight RefineNet

Stars: ✭ 139 (-0.71%)

Mutual labels: jupyter-notebook

Usiigaci: stain-free cell tracking in phase contrast microscopy enabled by supervised machine learning

Stars: ✭ 139 (-0.71%)

Mutual labels: jupyter-notebook

wide and deep labs and samples

Stars: ✭ 139 (-0.71%)

Mutual labels: jupyter-notebook

Tensorflow In Practice Specialization

Tensorflow-in-Practice-Specialization

Stars: ✭ 139 (-0.71%)

Mutual labels: jupyter-notebook

Intro To Tensorflow

This is an introduction to tensorflow

Stars: ✭ 139 (-0.71%)

Mutual labels: jupyter-notebook

Hadith Data Sets

All Hadith With Tashkil and Without Tashkel from the Nine Books that are 62,169 Hadith.

Stars: ✭ 139 (-0.71%)

Mutual labels: jupyter-notebook

Pytorch Faster Rcnn

pytorch1.0 updated. Support cpu test and demo. (Use detectron2, it's a masterpiece)

Stars: ✭ 1,779 (+1170.71%)

Mutual labels: jupyter-notebook

How to create custom COCO data set for object detection

Stars: ✭ 140 (+0%)

Mutual labels: jupyter-notebook

Nbconvert Examples

Examples that illustrate how nbconvert can be used

Stars: ✭ 139 (-0.71%)

Mutual labels: jupyter-notebook

Learning the Enigma with Recurrent Neural Networks

Stars: ✭ 139 (-0.71%)

Mutual labels: jupyter-notebook

View All Similar Projects ➔

CategoricalEncodingBenchmark

Benchmarking different approaches for categorical encoding

Reproducibility of results

Requirements

pip install -r requirements.txt

Benchmark the dataset

To benchmark encoders for your dataset:

Install libraries in requirements
Process the dataset as shown in notebooks/1-prepare-datasets.ipynb
Add name of the dataset in dataset_list in src/run_experiment.py
python run_experiment.py
Run notebooks/2-show-results.ipynb

Used datasets and raw scores

All datasets except poverty_A(B,C) came from different domains; they have a different number of observations, number of categorical and numerical features. The objective for all datasets - binary classification. Preprocessing of datasets were simple: I removed all time-based columns from datasets. Remaining columns were either categorical or numerical. Details of the experiments could be found in my blog post: Benchmarking Categorical Encoders.

Table 1.1 Used datasets

Name	Total points	Train points	Test points	Number of features	Number of categorical features	Short description
Telecom	7.0k	4.2k	2.8k	20	16	Churn prediction for telecom data
Adult	48.8k	29.3k	19.5k	15	8	Predict if persons' income is bigger 50k
Employee	32.7k	19.6k	13.1k	10	9	Predict an employee's access needs, given his/her job role
Credit	307.5k	184.5k	123k	121	18	Loan repayment
Mortgages	45.6k	27.4k	18.2k	20	9	Predict if house mortgage is founded
Promotion	54.8	32.8k	21.9k	13	5	Predict if an employee will get a promotion
Kick	72.9k	43.7k	29.1k	32	19	Predict if a car purchased at auction is good/bad buy
Kdd_upselling	50k	30k	20k	230	40	Predict up-selling for a customer
Taxi	892.5k	535.5k	357k	8	5	Predict the probability of an offer being accepted by a certain driver
Poverty_A	37.6k	22.5k	15.0k	41	38	Predict whether or not a given household for a given country is poor or not
Poverty_B	20.2k	12.1k	8.1k	224	191	Predict whether or not a given household for a given country is poor or not
Poverty_C	29.9k	17.9k	11.9k	41	35	Predict whether or not a given household for a given country is poor or not

The ROC AUC scores for each dataset are presented in tables below. Note: some experiments required too much memory to run, so some values are missing.

Table 1.2 ROC AUC scores for None Validation

	telecom	adult	employee	credit	mortgages	promotion	kick	kdd_upselling	taxi	poverty_A	poverty_B	poverty_C
BackwardDifferenceEncoder	0.6454	0.8555	0.5006	0.7442	0.5997	0.6482				0.5149	0.5484	0.4945
CatBoostEncoder	0.7666	0.868	0.5004	0.7478	0.6279	0.7811	0.6583	0.8549	0.5477	0.5179	0.5638	0.5427
FrequencyEncoder	0.8405	0.9291	0.807	0.7593	0.6949	0.9052	0.7907	0.8643	0.5656	0.7276	0.6164	0.7177
HelmertEncoder	0.8404	0.9297	0.83	0.7601	0.7001	0.9079				0.7325	0.6343	0.7168
JamesSteinEncoder	0.7195	0.8688	0.5003	0.7485	0.6049	0.7984	0.6592	0.8516	0.5432	0.4918	0.5304	0.4836
LeaveOneOutEncoder	0.5	0.5214	0.6233	0.4957	0.5	0.5457	0.5027	0.5	0.5	0.5006	0.5002	0.4527
MEstimateEncoder	0.6944	0.8617	0.4998	0.7368	0.6086	0.8156	0.653	0.8448	0.5091	0.5254	0.434	0.4528
OrdinalEncoder	0.7409	0.8616	0.501	0.7445	0.6008	0.7124	0.6531	0.8448	0.5498	0.473	0.4683	0.5611
SumEncoder	0.8404	0.929	0.8053	0.7593	0.6944	0.9073				0.7355	0.6206	0.7372
TargetEncoder	0.7195	0.8696	0.5003	0.7483	0.6064	0.7971	0.6594	0.8483	0.5428	0.4955	0.5401	0.4751
WOEEncoder	0.7056	0.8645	0.5012	0.7439	0.615	0.7345	0.6398	0.844	0.5485	0.478	0.5356	0.4671

Table 1.3 ROC AUC scores for Single Validation

	telecom	adult	employee	credit	mortgages	promotion	kick	kdd_upselling	taxi	poverty_A	poverty_B	poverty_C
BackwardDifferenceEncoder	0.8382	0.9293	0.7569	0.7595	0.6894	0.9064				0.7323	0.6151	0.7108
CatBoostEncoder	0.8392	0.9292	0.8498	0.7594	0.6951	0.8918	0.7901	0.8654	0.5844	0.7429	0.6902	0.7333
FrequencyEncoder	0.8392	0.9293	0.8138	0.7592	0.6937	0.9055	0.7902	0.8634	0.582	0.7302	0.6128	0.7195
HelmertEncoder	0.8404	0.9297	0.8344	0.7597	0.7027	0.9083				0.7297	0.6374	0.7196
JamesSteinEncoder	0.8388	0.9292	0.7817	0.7597	0.667	0.9053	0.5835	0.726	0.5898	0.7303	0.6764	0.7217
LeaveOneOutEncoder	0.5	0.5182	0.6121	0.4997	0.5	0.5403	0.4682	0.5	0.5	0.5103	0.5	0.4959
MEstimateEncoder	0.8394	0.929	0.7353	0.7593	0.6957	0.9054	0.5877	0.5953	0.5946	0.7302	0.6493	0.7076
OrdinalEncoder	0.8404	0.9299	0.8274	0.7585	0.6917	0.9078	0.7809	0.8465	0.6034	0.7337	0.6635	0.742
SumEncoder	0.8404	0.929	0.8053	0.7593	0.6944	0.9073				0.7355	0.6206	0.7372
TargetEncoder	0.8388	0.9293	0.815	0.7599	0.6702	0.9057	0.7042	0.713	0.5894	0.7292	0.6742	0.7207
WOEEncoder	0.8393	0.9294	0.8325	0.7599	0.6801	0.9056	0.7172	0.8391	0.5903	0.7279	0.6737	0.7224

Table 1.4 ROC AUC scores for Double Validation

	telecom	adult	employee	credit	mortgages	promotion	kick	kdd_upselling	taxi	poverty_A	poverty_B	poverty_C
CatBoostEncoder	0.8394	0.9293	0.8529	0.7592	0.6967	0.9056	0.7899	0.8633	0.6031	0.7418	0.6902	0.7343
FrequencyEncoder	0.8371	0.9221	0.5563	0.755	0.6582	0.8749	0.7655	0.8551	0.5657	0.6873	0.6037	0.6961
JamesSteinEncoder	0.8398	0.9296	0.8489	0.7598	0.6981	0.905	0.7901	0.8628	0.6033	0.7412	0.6895	0.7366
LeaveOneOutEncoder	0.8393	0.9295	0.8496	0.7595	0.6963	0.9055	0.7902	0.8635	0.602	0.7416	0.6931	0.7345
MEstimateEncoder	0.8405	0.9292	0.8125	0.7597	0.6939	0.9063	0.7881	0.863	0.5984	0.7375	0.6801	0.7204
TargetEncoder	0.8393	0.9294	0.8537	0.7596	0.6954	0.9057	0.7909	0.8643	0.6025	0.7415	0.6903	0.7352
WOEEncoder	0.8401	0.9294	0.824	0.7599	0.6977	0.9041	0.7905	0.8631	0.6011	0.7407	0.6911	0.7345

Results

To determine the best encoder, I scaled the ROC AUC scores of each dataset (min-max scale) and then averaged results among the encoder. The obtained result represents the average performance score for each encoder (higher is better). The encoders performance scores for each type of validation are shown in tables 2.1–2.3.

To determine the best validation strategy, I compared the top score of each dataset for each type of validation. The scores improvement (top score for a dataset and an average score for encoder) are shown in table 2.4 and 2.5 below.

Table 2.1 Encoders performance scores - None Validation

	None Validation
HelmertEncoder	0.9517
SumEncoder	0.9434
FrequencyEncoder	0.9176
CatBoostEncoder	0.5728
TargetEncoder	0.5174
JamesSteinEncoder	0.5162
OrdinalEncoder	0.4964
WOEEncoder	0.4905
MEstimateEncoder	0.4501
BackwardDifferenceEncoder	0.4128
LeaveOneOutEncoder	0.0697

Table 2.2 Encoders performance scores - Single Validation

	Single Validation
CatBoostEncoder	0.9726
OrdinalEncoder	0.9694
HelmertEncoder	0.9558
SumEncoder	0.9434
WOEEncoder	0.9326
FrequencyEncoder	0.9315
BackwardDifferenceEncoder	0.9108
TargetEncoder	0.8915
JamesSteinEncoder	0.8555
MEstimateEncoder	0.8189
LeaveOneOutEncoder	0.0729

Table 2.3 Encoders performance scores - Double Validation

	Double Validation
JamesSteinEncoder	0.9918
CatBoostEncoder	0.9917
TargetEncoder	0.9916
LeaveOneOutEncoder	0.9909
WOEEncoder	0.9838
MEstimateEncoder	0.9686
FrequencyEncoder	0.8018

Table 2.4 Top score improvement (percent)

	None -> Single	Single -> Double
telecom	0.00	0.01
adult	0.02	-0.03
employee	1.98	0.39
credit	-0.01	-0.00
mortgages	0.26	-0.47
promotion	0.04	-0.20
kick	-0.05	0.06
kdd_upselling	0.10	-0.11
taxi	3.78	-0.01
poverty_A	0.74	-0.11
poverty_B	5.59	0.29
poverty_C	0.48	-0.54

Table 2.5 Encoders performance scores improvement (percent)

	None -> Single	Single -> Double
BackwardDifferenceEncoder	27.20
CatBoostEncoder	20.10	0.40
FrequencyEncoder	0.30	-4.90
HelmertEncoder	0.20
JamesSteinEncoder	17.70	6.30
LeaveOneOutEncoder	0.20	53.20
MEstimateEncoder	18.90	8.10
OrdinalEncoder	24.10
SumEncoder	0.00
TargetEncoder	19.60	4.20
WOEEncoder	23.40	1.90

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 140

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (1) 🔗