Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → zake7749 → Deeptoxic

zake7749 / Deeptoxic

Licence: mit

top 1% solution to toxic comment classification challenge on Kaggle.

Labels

jupyter-notebook deep-learning pytorch tutorial keras natural-language-processing kaggle pos

Projects that are alternatives of or similar to Deeptoxic

Pytorch Pos Tagging

A tutorial on how to implement models for part-of-speech tagging using PyTorch and TorchText.

Stars: ✭ 96 (-46.67%)

Mutual labels: pos, jupyter-notebook, tutorial, natural-language-processing

Code search

Code For Medium Article: "How To Create Natural Language Semantic Search for Arbitrary Objects With Deep Learning"

Stars: ✭ 436 (+142.22%)

Mutual labels: jupyter-notebook, tutorial, natural-language-processing

Pytorch Question Answering

Important paper implementations for Question Answering using PyTorch

Stars: ✭ 154 (-14.44%)

Mutual labels: jupyter-notebook, tutorial, natural-language-processing

Mindspore Nlp Tutorial

Natural Language Processing Tutorial for MindSpore Users

Stars: ✭ 58 (-67.78%)

Mutual labels: jupyter-notebook, tutorial, natural-language-processing

Allstate capstone

Allstate Kaggle Competition ML Capstone Project

Stars: ✭ 72 (-60%)

Mutual labels: kaggle, jupyter-notebook, tutorial

Pytorch Sentiment Analysis

Tutorials on getting started with PyTorch and TorchText for sentiment analysis.

Stars: ✭ 3,209 (+1682.78%)

Mutual labels: jupyter-notebook, tutorial, natural-language-processing

Nlp Tutorial

Natural Language Processing Tutorial for Deep Learning Researchers

Stars: ✭ 9,895 (+5397.22%)

Mutual labels: jupyter-notebook, tutorial, natural-language-processing

Kaggle Notebooks

Sample notebooks for Kaggle competitions

Stars: ✭ 77 (-57.22%)

Mutual labels: kaggle, jupyter-notebook, tutorial

Gasyori100knock

image processing codes to understand algorithm

Stars: ✭ 1,988 (+1004.44%)

Mutual labels: jupyter-notebook, tutorial

Mixtext

MixText: Linguistically-Informed Interpolation of Hidden Space for Semi-Supervised Text Classification

Stars: ✭ 159 (-11.67%)

Mutual labels: jupyter-notebook, natural-language-processing

Learnpythonforresearch

This repository provides everything you need to get started with Python for (social science) research.

Stars: ✭ 163 (-9.44%)

Mutual labels: jupyter-notebook, tutorial

Machine Learning Workflow With Python

This is a comprehensive ML techniques with python: Define the Problem- Specify Inputs & Outputs- Data Collection- Exploratory data analysis -Data Preprocessing- Model Design- Training- Evaluation

Stars: ✭ 157 (-12.78%)

Mutual labels: kaggle, jupyter-notebook

Rnn lstm from scratch

How to build RNNs and LSTMs from scratch with NumPy.

Stars: ✭ 156 (-13.33%)

Mutual labels: jupyter-notebook, natural-language-processing

Interspeech2019 Tutorial

INTERSPEECH 2019 Tutorial Materials

Stars: ✭ 160 (-11.11%)

Mutual labels: jupyter-notebook, tutorial

Fixy

Amacımız Türkçe NLP literatüründeki birçok farklı sorunu bir arada çözebilen, eşsiz yaklaşımlar öne süren ve literatürdeki çalışmaların eksiklerini gideren open source bir yazım destekleyicisi/denetleyicisi oluşturmak. Kullanıcıların yazdıkları metinlerdeki yazım yanlışlarını derin öğrenme yaklaşımıyla çözüp aynı zamanda metinlerde anlamsal analizi de gerçekleştirerek bu bağlamda ortaya çıkan yanlışları da fark edip düzeltebilmek.

Stars: ✭ 165 (-8.33%)

Mutual labels: jupyter-notebook, natural-language-processing

Newsrecommender

A news recommendation system tailored for user communities

Stars: ✭ 164 (-8.89%)

Mutual labels: jupyter-notebook, natural-language-processing

Natural Language Processing Specialization

This repo contains my coursework, assignments, and Slides for Natural Language Processing Specialization by deeplearning.ai on Coursera

Stars: ✭ 151 (-16.11%)

Mutual labels: jupyter-notebook, natural-language-processing

Competition Baseline

数据科学竞赛知识、代码、思路

Stars: ✭ 2,553 (+1318.33%)

Mutual labels: kaggle, jupyter-notebook

Shape Detection

🟣 Object detection of abstract shapes with neural networks

Stars: ✭ 170 (-5.56%)

Mutual labels: jupyter-notebook, tutorial

Deep Math Machine Learning.ai

A blog which talks about machine learning, deep learning algorithms and the Math. and Machine learning algorithms written from scratch.

Stars: ✭ 173 (-3.89%)

Mutual labels: jupyter-notebook, natural-language-processing

View All Similar Projects ➔

DeepToxic

This is part of 27th solution for the toxic comment classification challenge. For easy understanding, I only uploaded what I used in the final stage, and did not attach any experimental or deprecated codes.

Dataset and External pretrained embeddings

You can fetch the dataset here. I used 3 kind of word embeddings:

Overview

Preprocessing

We trained our models on 3 datasets with different preprocessing:

original dataset with spellings correction: by comparing the Levenshtein distance and a lot of regular expressions.
original dataset with pos taggings: We generate the part of speech (POS) tagging for every comment by TextBlob and concatenate the word embedding and POS embedding as a single one. Since TextBlob drops some tokens and punctuations when generating the POS sequences, that gives our models another view.
Riad's dataset: with very heavily data-cleaning, spelling correction and translation

Models

In our case, the simpler, the better. I tried some complicated structures (RHN, DPCNN, HAN). Most of them had performed very well locally but got lower AUC on the leaderboard. The models I kept trying during the final stage are the following two:

Pooled RNN (public: 0.9862, private: 0.9858)

Kmax text CNN (public: 0.9856 , private: 0.9849)

As many competitors pointed out, dropout and batch-normalization are the keys to prevent overfitting. By applying the dropout on the word embedding directly and behind the pooling does great regularization both on train set and test set. Although model with many dropouts takes about 5 more epochs to coverage, it boosts our scores significantly. For instance, my RNN boosts from 0.9853 (private: 0.9850) to 0.9862 (private: 0.9858) after adding dropout layers.

For maximizing the utility of these datasets, besides training on the original labels, we also add a meta-label "bad_comment". If a comment is labeled, then it's considered to be a bad comment. The hypothesizes between these two labels sets are slightly different but with almost the same LB score, which leaves us room for the ensemble.

In order to increase the diversity and to deal with some toxic typos, we trained the models both on char-level and word-level. The results of char-level perform a bit worse (for charRNN: 0.983 on LB, 0.982 on PB, charCNN: 0.9808 on LB, 0.9801 on PB) but it does have a pretty low correlation with word-level models. By simply bagging my char-level and word-level result, it is good enough to push me over 0.9869 in the private test set. By the way, the hyperparameters influence the performance hugely in the char-based models. A large batch size (256), very long sequence length (1000) would ordinarily get a considerable result even though it takes much time for the K-fold validation. (my char-based models usually converge after 60~70 epochs which is about 5 times more than my word-based models.)

Performance of Single models

Scored by AUC on the private testset.

Word level

Model	Fasttext	Glove	Twitter
AVRNN	0.9858	0.9855	0.9843
Meta-AVRNN	0.9850	0.9849	No data
Pos-AVRNN	0.9850	No data	0.9841
AVCNN	0.9846	0.9845	0.9841
Meta-AVCNN	0.9844	0.9844	No data
Pos-AVCNN	0.9850	No data	No data
KmaxTextCNN	0.9849	0.9845	0.9835
TextCNN	0.9837	No data	No data
RCNN	0.9847	0.9842	0.9832
RHN	0.9842	No data	No data

Char level

Model	AUC
AVRNN	0.9821
KmaxCNN	0.9801
AVCNN	0.9797

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 180

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (0) 🔗