All Projects → pmichel31415 → Mtnt

pmichel31415 / Mtnt

Licence: mit
Code for the collection and analysis of the MTNT dataset

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Mtnt

Text2sql Data
A collection of datasets that pair questions with SQL queries.
Stars: ✭ 287 (+497.92%)
Mutual labels:  dataset, natural-language-processing
Nlp Progress
Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks.
Stars: ✭ 19,518 (+40562.5%)
Mutual labels:  natural-language-processing, machine-translation
Zhihu
This repo contains the source code in my personal column (https://zhuanlan.zhihu.com/zhaoyeyu), implemented using Python 3.6. Including Natural Language Processing and Computer Vision projects, such as text generation, machine translation, deep convolution GAN and other actual combat code.
Stars: ✭ 3,307 (+6789.58%)
Mutual labels:  natural-language-processing, machine-translation
Fakenewscorpus
A dataset of millions of news articles scraped from a curated list of data sources.
Stars: ✭ 255 (+431.25%)
Mutual labels:  dataset, natural-language-processing
Insuranceqa Corpus Zh
🚁 保险行业语料库,聊天机器人
Stars: ✭ 821 (+1610.42%)
Mutual labels:  dataset, natural-language-processing
Oie Resources
A curated list of Open Information Extraction (OIE) resources: papers, code, data, etc.
Stars: ✭ 283 (+489.58%)
Mutual labels:  dataset, natural-language-processing
Data Science
Collection of useful data science topics along with code and articles
Stars: ✭ 315 (+556.25%)
Mutual labels:  scraping, natural-language-processing
Pytorch Nlp
Basic Utilities for PyTorch Natural Language Processing (NLP)
Stars: ✭ 1,996 (+4058.33%)
Mutual labels:  dataset, natural-language-processing
Texar Pytorch
Integrating the Best of TF into PyTorch, for Machine Learning, Natural Language Processing, and Text Generation. This is part of the CASL project: http://casl-project.ai/
Stars: ✭ 636 (+1225%)
Mutual labels:  natural-language-processing, machine-translation
Hate Speech And Offensive Language
Repository for the paper "Automated Hate Speech Detection and the Problem of Offensive Language", ICWSM 2017
Stars: ✭ 543 (+1031.25%)
Mutual labels:  dataset, natural-language-processing
Chazutsu
The tool to make NLP datasets ready to use
Stars: ✭ 238 (+395.83%)
Mutual labels:  dataset, natural-language-processing
String To Tree Nmt
Source code and data for the paper "Towards String-to-Tree Neural Machine Translation"
Stars: ✭ 16 (-66.67%)
Mutual labels:  natural-language-processing, machine-translation
Korean Hate Speech
Korean HateSpeech Dataset
Stars: ✭ 192 (+300%)
Mutual labels:  dataset, natural-language-processing
Clean Text
🧹 Python package for text cleaning
Stars: ✭ 284 (+491.67%)
Mutual labels:  scraping, natural-language-processing
Nlp bahasa resources
A Curated List of Dataset and Usable Library Resources for NLP in Bahasa Indonesia
Stars: ✭ 158 (+229.17%)
Mutual labels:  dataset, natural-language-processing
Bytenet Tensorflow
ByteNet for character-level language modelling
Stars: ✭ 319 (+564.58%)
Mutual labels:  natural-language-processing, machine-translation
Mams For Absa
A Multi-Aspect Multi-Sentiment Dataset for aspect-based sentiment analysis.
Stars: ✭ 135 (+181.25%)
Mutual labels:  dataset, natural-language-processing
Prosody
Helsinki Prosody Corpus and A System for Predicting Prosodic Prominence from Text
Stars: ✭ 139 (+189.58%)
Mutual labels:  dataset, natural-language-processing
Doccano
Open source annotation tool for machine learning practitioners.
Stars: ✭ 5,600 (+11566.67%)
Mutual labels:  dataset, natural-language-processing
Nlg Eval
Evaluation code for various unsupervised automated metrics for Natural Language Generation.
Stars: ✭ 822 (+1612.5%)
Mutual labels:  natural-language-processing, machine-translation
MTNT

MTNT: A Testbed for Machine Translation of Noisy Text

Codacy Badge

This repo contains the code for the EMNLP 2018 paper MTNT: A Testbed for Machine Translation of Noisy Text. It will allow you to reproduce the collection process as well as the MT experiments. You can access the data here.

Prerequisites

For preprocessing, you will need Moses (for tokenization, clean-up, etc...), sentencepiece (for subwords) and KenLM (for n-gram language modeling). If you want to work with japanese data you should also install Kytea (for word segmentation)

To run the collection code, you will need the following python modules:

kenlm
langid
numpy
pickle
praw
sentencepiece>=0.1.6
yaml

Finally, for the MT experiments, refer to the README in the recipes folder

Downloading and Preparing the Data

From this folder, run

# Monolingual en data from WMT17
bash scripts/download_en.sh config/data.en.config
bash scripts/prepare_model config/data.en.config

# Monolingual fr data from WMT15
bash scripts/download_fr.sh config/data.fr.config
bash scripts/prepare_model config/data.fr.config

# Prepare en<->fr parallel data
bash scripts/prepare-en-fr.sh config/data.fr.config path/to/moses/scripts

# Download and prepare the en<->ja monolingual and parallel data
bash scripts/download_ja.sh config/data.ja.config path/to/moses/scripts

# Download and extract MTNT
wget http://www.cs.cmu.edu/~pmichel1/hosting/MTNT.1.0.tar.gz && tar xvzf MTNT.1.0.tar.gz && rm MTNT.1.0.tar.gz
# Split the tsv files
bash MTNT/split_tsv.sh

You can edit the config/data.{en,fr,ja}.config files to change filenames, subword parameters, etc...

Running the Scraper

Edit the config/{en,fr,ja}_reddit.yaml to include the appropriate credentials for your bot. You can also change some of the parameters (like subreddits, etc...).

Then run

bash scripts/start_scraper.sh [config_file]`

When running the scraper, be mindful of the Reddit API terms.

Analysing the Data

You can analyse the collected data using the various scripts in the analysis folder, for example:

# Count the number of profanities (should return 38)
cat MTNT/test/test.en-fr.en | python3 analysis/count_keywords.py resources/profanities.en
# Count the number of emojis (should return 46)
cat MTNT/test/test.en-fr.en | python3 analysis/count_emojis.py
# Check the ration US/UK spelling (for ise/ize which is a good indicator) (should return 35.7% 64.3%)
cat MTNT/test/test.en-fr.en | python3 analysis/uk_us_ratio.py
# Count the number of informal pronouns (in japanese) (should return 35)
kytea -model /path/to/kytea/data/model.bin -out tok MTNT/test/test.ja-en.ja | python3 analysis/count_keywords.py resources/informal_pronouns.ja

Citing

If you use this code or the MTNT dataset, please cite the following publication:

@InProceedings{michel2018mtnt,
  author    = {Michel, Paul  and  Neubig, Graham},
  title     = {{MTNT}: A Testbed for {M}achine {T}ranslation of {N}oisy {T}ext},
  booktitle = {Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year      = {2018}
}

License

The code is released under the MIT License. The data is released under the terms of the Reddit API

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].