All Projects → mattmotoki → Toxic Comment Classification

mattmotoki / Toxic Comment Classification

Code and write-up for the Kaggle Toxic Comment Classification Challenge

Projects that are alternatives of or similar to Toxic Comment Classification

Lpproj
Scikit-learn compatible Locality Preserving Projections in Python
Stars: ✭ 74 (-1.33%)
Mutual labels:  jupyter-notebook
Deetctionupperbound
Code for calculating the upper bound AP in object detection
Stars: ✭ 74 (-1.33%)
Mutual labels:  jupyter-notebook
Ml Webinar
Machine Learning with sklearn tutorials (for Pearson)
Stars: ✭ 75 (+0%)
Mutual labels:  jupyter-notebook
Mit Deep Learning
Tutorials, assignments, and competitions for MIT Deep Learning related courses.
Stars: ✭ 8,912 (+11782.67%)
Mutual labels:  jupyter-notebook
Covid 19
Data analysis and visualizations of daily COVID cases report
Stars: ✭ 75 (+0%)
Mutual labels:  jupyter-notebook
Computer vision project
计算机视觉项目实战
Stars: ✭ 75 (+0%)
Mutual labels:  jupyter-notebook
Image retrieval
Image retrieval system demo based on caffe and lsh
Stars: ✭ 74 (-1.33%)
Mutual labels:  jupyter-notebook
Tf Eager Examples
A set of simple examples ported from PyTorch for Tensorflow Eager Execution
Stars: ✭ 75 (+0%)
Mutual labels:  jupyter-notebook
Smartnoise Samples
Code samples and documentation for SmartNoise differential privacy tools
Stars: ✭ 75 (+0%)
Mutual labels:  jupyter-notebook
Urbanregionfunctionclassification
第五届百度西安交大大数据竞赛 城市区域功能分类 Baseline
Stars: ✭ 75 (+0%)
Mutual labels:  jupyter-notebook
Nlp Tutorial
Natural Language Processing Tutorial for Deep Learning Researchers
Stars: ✭ 9,895 (+13093.33%)
Mutual labels:  jupyter-notebook
Pytorch Book
PyTorch tutorials and fun projects including neural talk, neural style, poem writing, anime generation (《深度学习框架PyTorch:入门与实战》)
Stars: ✭ 9,546 (+12628%)
Mutual labels:  jupyter-notebook
Practicalsessions2019
Materials for the practical sessions at EEML2019
Stars: ✭ 75 (+0%)
Mutual labels:  jupyter-notebook
Pygslib
GSLIB fortran code wrapped into python
Stars: ✭ 74 (-1.33%)
Mutual labels:  jupyter-notebook
Trajectron
Code accompanying "The Trajectron: Probabilistic Multi-Agent Trajectory Modeling with Dynamic Spatiotemporal Graphs" by Boris Ivanovic and Marco Pavone.
Stars: ✭ 75 (+0%)
Mutual labels:  jupyter-notebook
Pug classifier
Deep Learning for Pugs
Stars: ✭ 74 (-1.33%)
Mutual labels:  jupyter-notebook
Piaic islamabad batch3
Stars: ✭ 75 (+0%)
Mutual labels:  jupyter-notebook
Swae
Implementation of the Sliced Wasserstein Autoencoders
Stars: ✭ 75 (+0%)
Mutual labels:  jupyter-notebook
Preprocessing For Deep Learning
This is the notebook associated with the blog post:
Stars: ✭ 75 (+0%)
Mutual labels:  jupyter-notebook
Weibosentiment
基于各种机器学习和深度学习的中文微博情感分析
Stars: ✭ 75 (+0%)
Mutual labels:  jupyter-notebook

Kaggle - Toxic Comment Classification Challenge

  • 33rd Place Solution
  • Private LB: 0.9872, 33/4551
  • Public LB: 0.9876, 45/4551

This is the write-up and code for the Toxic Comment Classification Challenge where I placed 33rd out of 4,551 teams. For more information about my approach see my discussion post.

We were tasked with a multi-label classification problem; in particular, the task was to classify online comments into 6 categories: toxic, severve_toxic, obscene, threat, insult, identity_hate. The competition metric was the average of the individual AUCs of each predicted class.

Summary of approach:

Embeddings:

Models (best private score shown):

  • CapsuleNet (0.9860 private, 0.9859 public)
  • RNN Version 1 (0.9858 private, 0.9863 public)
  • RNN Version 2 (0.9856 private, 0.9861 public)
  • Two Layer CNN (0.9826 private, 0.9835 public)
  • NB-SVM (0.9813 private, 0.9813 public)

Ensembling (best private score shown):

  • Level 1a: Average 10 out-of-fold predictions (as high as 0.9860 private, 0.9859 public)
  • Level 1b: Average models with different embeddings (as high as 0.9866 private, 0.9871 public)
  • Level 2a: LightGBM Stacking (0.9870 private, 0.9874 public)
  • Level 2b: Average multiple seeds (0.9872 private, 0.9876 public)

Embedding Imputation Details:

My main insight in this competition was how to handle out-of-vocabulary (OOV) words. Replacing missing vectors with zeros or random numbers is suboptimal. Using fastText's built-in OOV prediction instead of naive replacement increases the AUC by ~0.002. For GloVe and LexVec embeddings, I replaced the missing embeddings with similar vectors. To do this, I first trained a fastText model on the data for this competition:

  fasttext skipgram -input "${INPUT_FILE}" -output "${OUTPUT_FILE}" \
  -minCount 1 -neg 25 -thread 8 -dim 300

The -minCount 1 flag ensures that we get perfect recall; i.e., we get a vector for every word in our vocabulary. We can now find the most similar vector in the intersection of the local vocabulary (from this competition) with the external vocabulary (from pretrained embeddings). Here's the psuedocode to do that1:

local = {local_words: local_vectors}
external = {external_words: external_vectors}
shared_words = intersect(local_words, external_words)
missing_words = setdiff(local_words, external_words)
reference_matrix = array(local[w] for w in shared_words).T

for w in missing_words:
     similarity = local[w] * reference_matrix
     most_similar_word = shared_words[argmax(similarity)]
     external[w] = external_vectors[most_similar_word]

return {w: external[w] for w in local_words}

With this technique, GloVe performed just as well if not better than the fastText with OOV prediction; LexVec performed slightly worse but added valuable diversity to ensembles.

Timing:

The bulk of the calculation boils down to a vector matrix multiplication. The naive implementation takes about 20 mins. We can reduce this to about 4 mins by processing missing words in batches. Using PyTorch (and a 1080ti), we can get the timing down to about 1 min.

Results:

Here is a table of the scores for a single seed; here "Toxic" refers to the 300d vectors trained locally using fastText.

Model Embeddings Private Public Local
CapsuleNet fastText 0.9855 0.9867 0.9896
CapsuleNet GloVe 0.9860 0.9859 0.9899
CapsuleNet LexVec 0.9855 0.9858 0.9898
CapsuleNet Toxic 0.9859 0.9863 0.9901
RNN Version 2 fastText 0.9856 0.9864 0.9904
RNN Version 2 GloVe 0.9858 0.9863 0.9902
RNN Version 2 LexVec 0.9857 0.9859 0.9902
RNN Version 2 Toxic 0.9851 0.9855 0.9906
RNN Version 1 fastText 0.9853 0.9859 0.9898
RNN Version 1 GloVe 0.9855 0.9861 0.9901
RNN Version 1 LexVec 0.9854 0.9857 0.9897
RNN Version 1 Toxic 0.9856 0.9861 0.9903
2 Layer CNN fastText 0.9826 0.9835 0.9886
2 Layer CNN GloVe 0.9827 0.9828 0.9883
2 Layer CNN LexVec 0.9824 0.9831 0.9880
2 Layer CNN Toxic 0.9806 0.9789 0.9880
SVM with NB features NA 0.9813 0.9813 0.9863

1 This is assuming all word vectors are normalized so that the inner product is the same as the cosine similarity.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].