All Projects → skoltech-nlp → rudetoxifier

skoltech-nlp / rudetoxifier

Licence: MIT license
Code and data of "Methods for Detoxification of Texts for the Russian Language" paper

Programming Languages

python
139335 projects - #7 most used programming language
Jupyter Notebook
11667 projects

Projects that are alternatives of or similar to rudetoxifier

Keras-Style-Transfer
An implementation of "A Neural Algorithm of Artistic Style" in Keras
Stars: ✭ 36 (+20%)
Mutual labels:  style-transfer
VisualML
Interactive Visual Machine Learning Demos.
Stars: ✭ 104 (+246.67%)
Mutual labels:  style-transfer
deep dream
DeepDream psychodelic image generator.
Stars: ✭ 39 (+30%)
Mutual labels:  style-transfer
a-neural-algorithm-of-artistic-style
Keras implementation of "A Neural Algorithm of Artistic Style"
Stars: ✭ 110 (+266.67%)
Mutual labels:  style-transfer
STYLER
Official repository of STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech, INTERSPEECH 2021
Stars: ✭ 105 (+250%)
Mutual labels:  style-transfer
Shakespearizing-Modern-English
Code for "Jhamtani H.*, Gangal V.*, Hovy E. and Nyberg E. Shakespearizing Modern Language Using Copy-Enriched Sequence to Sequence Models" Workshop on Stylistic Variation, EMNLP 2017
Stars: ✭ 64 (+113.33%)
Mutual labels:  style-transfer
Music-Style-Transfer
Source code for "Transferring the Style of Homophonic Music Using Recurrent Neural Networks and Autoregressive Model"
Stars: ✭ 16 (-46.67%)
Mutual labels:  style-transfer
CycleGAN-Music-Style-Transfer-Refactorization
Symbolic Music Genre Transfer with CycleGAN - Refactorization
Stars: ✭ 28 (-6.67%)
Mutual labels:  style-transfer
linguistic-style-transfer-pytorch
Implementation of "Disentangled Representation Learning for Non-Parallel Text Style Transfer(ACL 2019)" in Pytorch
Stars: ✭ 55 (+83.33%)
Mutual labels:  style-transfer
One-Shot-Voice-Cloning
☺️ One Shot Voice Cloning base on Unet-TTS
Stars: ✭ 118 (+293.33%)
Mutual labels:  style-transfer
Wasserstein2GenerativeNetworks
PyTorch implementation of "Wasserstein-2 Generative Networks" (ICLR 2021)
Stars: ✭ 38 (+26.67%)
Mutual labels:  style-transfer
ideas
Идеи по улучшению языка C++ для обсуждения
Stars: ✭ 65 (+116.67%)
Mutual labels:  russian-language
Android-Tensorflow-Style-Transfer
Based on tensorflow's style transfer Android project.
Stars: ✭ 18 (-40%)
Mutual labels:  style-transfer
favorite-research-papers
Listing my favorite research papers 📝 from different fields as I read them.
Stars: ✭ 12 (-60%)
Mutual labels:  style-transfer
Houdini-Plugin-for-Tensorflow-Smoke-Stylization
Tensorflow implementation of Style Transfer for Smoke Simulations. Created as part of ETH Zurich Student Summer Research Fellowship
Stars: ✭ 33 (+10%)
Mutual labels:  style-transfer
AdaAttN
Officially unofficial PyTorch re-implementation of paper: AdaAttN: Revisit Attention Mechanism in Arbitrary Neural Style Transfer, ICCV 2021.
Stars: ✭ 138 (+360%)
Mutual labels:  style-transfer
Image recoloring
Image Recoloring Based on Object Color Distributions (Eurographics 2019)
Stars: ✭ 30 (+0%)
Mutual labels:  style-transfer
CartoonGAN-tensorflow
Simple code implement the paper of CartoonGAN(CVPR2018)
Stars: ✭ 14 (-53.33%)
Mutual labels:  style-transfer
ganslate
Simple and extensible GAN image-to-image translation framework. Supports natural and medical images.
Stars: ✭ 17 (-43.33%)
Mutual labels:  style-transfer
zero-shot-style-transfer
TensorFlow Implementation of Several Zero-Shot Image Style Transfer Methods
Stars: ✭ 14 (-53.33%)
Mutual labels:  style-transfer

Methods for Detoxification of Texts for the Russian Language (ruDetoxifier)

This repository contains models and evaluation methodology for the detoxification task of Russian texts. The original paper "Methods for Detoxification of Texts for the Russian Language" was presented at Dialogue-2021 conference.

Inference Example

In this repository, we release two best models detoxGPT and condBERT (see Methodology for more details). You can try detoxification inference example in this notebook or Open In Colab.

Interactive Demo

Also, you can test our models via web-demo or you can pour out your anger on our Telegram bot.


Methodology

In our research, we tested several approaches:

Baselines

  • Duplicate: simple duplication of the input;
  • Delete: removal of rude and toxic from pre-defined vocab;
  • Retrieve: retrieval based on cosine similarity between word embeddings from non-toxic part of RuToxic dataset;

detoxGPT

Based on ruGPT models. This method requires parallel dataset for training. We tested ruGPT-small, ruGPT-medium, and ruGPT-large models in several setups:

  • zero-shot: the model is taken as is (with no fine-tuning). The input is a toxic sentence which we would like to detoxify prepended with the prefix “Перефразируй” (rus. Paraphrase) and followed with the suffix “>>>” to indicate the paraphrasing task
  • few-shot: the model is taken as is. Unlike the previous scenario, we give a prefix consisting of a parallel dataset of toxic and neutral sentences.
  • fine-tuned: the model is fine-tuned for the paraphrasing task on a parallel dataset.

condBERT

Based on BERT model. This method does not require parallel dataset for training. One of the tasks on which original BERT was pretrained -- predicting the word that should was replaced with a [MASK] token -- suits delete-retrieve-generate style transfer method. We tested RuBERT and Geotrend pre-trained models in several setups:

  • zero-shot where BERT is taken as is (with no extra fine-tuning);
  • fine-tuned where BERT is fine-tuned on a dataset of toxic and safe sentences to acquire a style- dependent distribution, as described above.

Automatic Evaluation

The evaluation consists of three types of metrics:

  • style transfer accuracy (STA): accuracy based on toxic/non-toxic classifier (we suppose that the resulted text should be in non-toxic style)
  • content preservation:
    • word overlap (WO);
    • BLEU: accuracy based on n-grams (1-4);
    • cosine similarity (CS): between vectors of texts’ embeddings.
  • language quality: perplexity (PPL) based on language model.

Finally, aggregation metric: geometric mean between STA, CS and PPL.

Launching

You can run ru_metric.py script for evaluation. The fine-tuned weights for toxicity classifier can be found here.


Results

Method STA↑ CS↑ WO↑ BLEU↑ PPL↓ GM↑
Baselines
Duplicate 0.00 1.00 1.00 1.00 146.00 0.05 ± 0.0012
Delete 0.27 0.96 0.85 0.81 263.55 0.10 ± 0.0007
Retrieve 0.91 0.85 0.07 0.09 65.74 0.22 ± 0.0010
detoxGPT-small
zero-shot 0.93 0.20 0.00 0.00 159.11 0.10 ± 0.0005
few-shot 0.17 0.70 0.05 0.06 83.38 0.11 ± 0.0009
fine-tuned 0.51 0.70 0.05 0.05 39.48 0.20 ± 0.0011
detoxGPT-medium
fine-tuned 0.49 0.77 0.18 0.21 86.75 0.16 ± 0.0009
detoxGPT-large
fine-tuned 0.61 0.77 0.22 0.21 36.92 0.23 ± 0.0010
condBERT
DeepPavlov zero-shot 0.53 0.80 0.42 0.61 668.58 0.08 ± 0.0006
DeepPavlov fine-tuned 0.52 0.86 0.51 0.53 246.68 0.12 ± 0.0007
Geotrend zero-shot 0.62 0.85 0.54 0.64 237.46 0.13 ± 0.0009
Geotrend fine-tuned 0.66 0.86 0.54 0.64 209.95 0.14 ± 0.0009

Data

Folder data consists of all used train datasets, test data and naive example of style transfer result:

  • data/train: RuToxic dataset, list of Russian rude words, and 200 samples of parallel sentences that were used for ruGPT fine-tuning;
  • data/test: 10,000 samples that were used for approaches evaluation;
  • data/results: example of style transfer output format illustrated with naive duplication.

Citation

If you find this repository helpful, feel free to cite our publication:

@article{DBLP:journals/corr/abs-2105-09052,
  author    = {Daryna Dementieva and
               Daniil Moskovskiy and
               Varvara Logacheva and
               David Dale and
               Olga Kozlova and
               Nikita Semenov and
               Alexander Panchenko},
  title     = {Methods for Detoxification of Texts for the Russian Language},
  journal   = {CoRR},
  volume    = {abs/2105.09052},
  year      = {2021},
  url       = {https://arxiv.org/abs/2105.09052},
  archivePrefix = {arXiv},
  eprint    = {2105.09052},
  timestamp = {Mon, 31 May 2021 16:16:57 +0200},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2105-09052.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

Contacts

For any questions please contact Daryna Dementieva via email or Telegram.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].