All Projects → hate-alert → DE-LIMIT

hate-alert / DE-LIMIT

Licence: MIT license
DeEpLearning models for MultIlingual haTespeech (DELIMIT): Benchmarking multilingual models across 9 languages and 16 datasets.

Programming Languages

Jupyter Notebook
11667 projects
python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to DE-LIMIT

kwx
BERT, LDA, and TFIDF based keyword extraction in Python
Stars: ✭ 33 (-63.33%)
Mutual labels:  multilingual, bert
JointIDSF
BERT-based joint intent detection and slot filling with intent-slot attention mechanism (INTERSPEECH 2021)
Stars: ✭ 55 (-38.89%)
Mutual labels:  bert
Transformer-QG-on-SQuAD
Implement Question Generator with SOTA pre-trained Language Models (RoBERTa, BERT, GPT, BART, T5, etc.)
Stars: ✭ 28 (-68.89%)
Mutual labels:  bert
FasterTransformer
Transformer related optimization, including BERT, GPT
Stars: ✭ 1,571 (+1645.56%)
Mutual labels:  bert
bert for corrector
基于bert进行中文文本纠错
Stars: ✭ 199 (+121.11%)
Mutual labels:  bert
BertSimilarity
Computing similarity of two sentences with google's BERT algorithm。利用Bert计算句子相似度。语义相似度计算。文本相似度计算。
Stars: ✭ 348 (+286.67%)
Mutual labels:  bert
hugo-notice
A Hugo theme component to display nice notices
Stars: ✭ 138 (+53.33%)
Mutual labels:  multilingual
KitanaQA
KitanaQA: Adversarial training and data augmentation for neural question-answering models
Stars: ✭ 58 (-35.56%)
Mutual labels:  bert
TraduXio
A participative platform for cultural texts translators
Stars: ✭ 19 (-78.89%)
Mutual labels:  multilingual
backprop
Backprop makes it simple to use, finetune, and deploy state-of-the-art ML models.
Stars: ✭ 229 (+154.44%)
Mutual labels:  bert
ganbert
Enhancing the BERT training with Semi-supervised Generative Adversarial Networks
Stars: ✭ 205 (+127.78%)
Mutual labels:  bert
oreilly-bert-nlp
This repository contains code for the O'Reilly Live Online Training for BERT
Stars: ✭ 19 (-78.89%)
Mutual labels:  bert
tensorflow-ml-nlp-tf2
텐서플로2와 머신러닝으로 시작하는 자연어처리 (로지스틱회귀부터 BERT와 GPT3까지) 실습자료
Stars: ✭ 245 (+172.22%)
Mutual labels:  bert
question generator
An NLP system for generating reading comprehension questions
Stars: ✭ 188 (+108.89%)
Mutual labels:  bert
i18n-language.js
i18n-language.js is Simple i18n language with Vanilla Javascript
Stars: ✭ 21 (-76.67%)
Mutual labels:  multilingual
neural-ranking-kd
Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation
Stars: ✭ 74 (-17.78%)
Mutual labels:  bert
Sohu2019
2019搜狐校园算法大赛
Stars: ✭ 26 (-71.11%)
Mutual labels:  bert
CAIL
法研杯CAIL2019阅读理解赛题参赛模型
Stars: ✭ 34 (-62.22%)
Mutual labels:  bert
SA-BERT
CIKM 2020: Speaker-Aware BERT for Multi-Turn Response Selection in Retrieval-Based Chatbots
Stars: ✭ 71 (-21.11%)
Mutual labels:  bert
Romanian-Transformers
This repo is the home of Romanian Transformers.
Stars: ✭ 60 (-33.33%)
Mutual labels:  bert

Hits contributions welcome

Deep Learning Models for Multilingual Hate Speech Detection

🇵🇹 🇸🇦 🇵🇱 🇮🇩 🇮🇹 Solving the problem of hate speech detection in 9 languages across 16 datasets. 🇫🇷 🇺🇸 🇪🇸 🇩🇪

New update -- 🎉 🎉 all our BERT models are available here. Be sure to check it out 🎉 🎉.

Demo

Please look here to check model loading and inference.

Please cite our paper in any published work that uses any of these resources.

@inproceedings{aluru2021deep,
  title={A Deep Dive into Multilingual Hate Speech Classification},
  author={Aluru, Sai Saketh and Mathew, Binny and Saha, Punyajoy and Mukherjee, Animesh},
  booktitle={Machine Learning and Knowledge Discovery in Databases. Applied Data Science and Demo Track: European Conference, ECML PKDD 2020, Ghent, Belgium, September 14--18, 2020, Proceedings, Part V},
  pages={423--439},
  year={2021},
  organization={Springer International Publishing}
}

Folder Description 👈


./Dataset             --> Contains the dataset related files.
./BERT_Classifier     --> Contains the codes for BERT classifiers performing binary classifier on the dataset
./CNN_GRU	      --> Contains the codes for CNN-GRU model		
./LASER+LR 	      --> Containes the codes for Logistic regression classifier used on top of LASER embeddings

Requirements

Make sure to use Python3 when running the scripts. The package requirements can be obtained by running pip install -r requirements.txt.


Dataset

Check out the Dataset folder to know more about how we curated the dataset for different languages. ⚠️ There are few datasets which requires crawling them hence we can gurantee the retrieval of all the datapoints as tweets may get deleted. ⚠️


Models used for our this task

We release the code for train/finetuning the following models along with their hyperparamters.

🥇 best for high resource language , 🏅 best for low resource language

✈️ fastest to train , 🛩️ slowest to train

  1. mBERT Baseline: This setting consists of using multilingual bert model with the same language dataset for training and testing. Refer to BERT Classifier folder for the codes and usage instructions.

  2. mBERT All_but_one:🥇🛩️ This setting consists of using multilingual bert model with training dataset from multiple languages and validation and test from a single target language. Refer to BERT Classifier folder for the codes and usage instructions.

  3. Translation + BERT Baseline: This setting consists of translating the other language datasets to english and finetuning the bert-base model using this translated datasets. Refer to BERT Classifier folder for the codes and usage instructions.

  4. CNN+GRU Baseline: This setting consists of using MUSE word embeddings along with a CNN-GRU based model, and training and testing on the same language. Refer to CNN_GRU folder for the codes and usage instructions.

  5. LASER+LR baseline:✈️ This setting consists of training a logistic regression model on the LASER embeddings of the dataset. The training and testing dataset are from the same language. Refer to LASER+LR folder for the codes and usage instructions.

  6. LASER+LR all_but_one:🏅 This setting consists of training a logistic regression model on the LASER embeddings of the dataset. The dataset from other languages are also used to train the LR model. Refer to LASER+LR folder for the codes and usage instructions.

Blogs and github repos which we used for reference 👼

  1. Muse embeddding are downloaded and extracted using the code from MUSE github repository
  2. For finetuning BERT this blog by Chris McCormick is used and we also referred Transformers github repo
  3. For CNN-GRU model we used the original repo for reference
  4. For generating the LASER embeddings of the dataset, we used the code from LASER github repository

For more details about our paper

Sai Saketh Aluru, Binny Mathew, Punyajoy Saha and Animesh Mukherjee. 2020. "Deep Learning Models for Multilingual Hate Speech Detection". ECML-PKDD

Todos

  • Upload our models to transformers community to make them public
  • Add arxiv paper link and description
  • Create an interface for social scientists where they can use our models easily with their data
  • Create a pull request to add the models to official transformers repo
👍 The repo is still in active developements. Feel free to create an issue !! 👍
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].