Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → pmichel31415 → Mtnt

pmichel31415 / Mtnt

Licence: mit

Code for the collection and analysis of the MTNT dataset

Programming Languages

python

139335 projects - #7 most used programming language

Labels

natural-language-processing dataset scraping machine-translation

Projects that are alternatives of or similar to Mtnt

Text2sql Data

A collection of datasets that pair questions with SQL queries.

Stars: ✭ 287 (+497.92%)

Mutual labels: dataset, natural-language-processing

Nlp Progress

Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks.

Stars: ✭ 19,518 (+40562.5%)

Mutual labels: natural-language-processing, machine-translation

Zhihu

This repo contains the source code in my personal column (https://zhuanlan.zhihu.com/zhaoyeyu), implemented using Python 3.6. Including Natural Language Processing and Computer Vision projects, such as text generation, machine translation, deep convolution GAN and other actual combat code.

Stars: ✭ 3,307 (+6789.58%)

Mutual labels: natural-language-processing, machine-translation

Fakenewscorpus

A dataset of millions of news articles scraped from a curated list of data sources.

Stars: ✭ 255 (+431.25%)

Mutual labels: dataset, natural-language-processing

Insuranceqa Corpus Zh

🚁 保险行业语料库，聊天机器人

Stars: ✭ 821 (+1610.42%)

Mutual labels: dataset, natural-language-processing

Oie Resources

A curated list of Open Information Extraction (OIE) resources: papers, code, data, etc.

Stars: ✭ 283 (+489.58%)

Mutual labels: dataset, natural-language-processing

Data Science

Collection of useful data science topics along with code and articles

Stars: ✭ 315 (+556.25%)

Mutual labels: scraping, natural-language-processing

Pytorch Nlp

Basic Utilities for PyTorch Natural Language Processing (NLP)

Stars: ✭ 1,996 (+4058.33%)

Mutual labels: dataset, natural-language-processing

Texar Pytorch

Integrating the Best of TF into PyTorch, for Machine Learning, Natural Language Processing, and Text Generation. This is part of the CASL project: http://casl-project.ai/

Stars: ✭ 636 (+1225%)

Mutual labels: natural-language-processing, machine-translation

Hate Speech And Offensive Language

Repository for the paper "Automated Hate Speech Detection and the Problem of Offensive Language", ICWSM 2017

Stars: ✭ 543 (+1031.25%)

Mutual labels: dataset, natural-language-processing

Chazutsu

The tool to make NLP datasets ready to use

Stars: ✭ 238 (+395.83%)

Mutual labels: dataset, natural-language-processing

String To Tree Nmt

Source code and data for the paper "Towards String-to-Tree Neural Machine Translation"

Stars: ✭ 16 (-66.67%)

Mutual labels: natural-language-processing, machine-translation

Korean Hate Speech

Korean HateSpeech Dataset

Stars: ✭ 192 (+300%)

Mutual labels: dataset, natural-language-processing

Clean Text

🧹 Python package for text cleaning

Stars: ✭ 284 (+491.67%)

Mutual labels: scraping, natural-language-processing

Nlp bahasa resources

A Curated List of Dataset and Usable Library Resources for NLP in Bahasa Indonesia

Stars: ✭ 158 (+229.17%)

Mutual labels: dataset, natural-language-processing

Bytenet Tensorflow

ByteNet for character-level language modelling

Stars: ✭ 319 (+564.58%)

Mutual labels: natural-language-processing, machine-translation

Mams For Absa

A Multi-Aspect Multi-Sentiment Dataset for aspect-based sentiment analysis.

Stars: ✭ 135 (+181.25%)

Mutual labels: dataset, natural-language-processing

Prosody

Helsinki Prosody Corpus and A System for Predicting Prosodic Prominence from Text

Stars: ✭ 139 (+189.58%)

Mutual labels: dataset, natural-language-processing

Doccano

Open source annotation tool for machine learning practitioners.

Stars: ✭ 5,600 (+11566.67%)

Mutual labels: dataset, natural-language-processing

Nlg Eval

Evaluation code for various unsupervised automated metrics for Natural Language Generation.

Stars: ✭ 822 (+1612.5%)

Mutual labels: natural-language-processing, machine-translation

View All Similar Projects ➔

MTNT: A Testbed for Machine Translation of Noisy Text

This repo contains the code for the EMNLP 2018 paper MTNT: A Testbed for Machine Translation of Noisy Text. It will allow you to reproduce the collection process as well as the MT experiments. You can access the data here.

Prerequisites

For preprocessing, you will need Moses (for tokenization, clean-up, etc...), sentencepiece (for subwords) and KenLM (for n-gram language modeling). If you want to work with japanese data you should also install Kytea (for word segmentation)

To run the collection code, you will need the following python modules:

kenlm
langid
numpy
pickle
praw
sentencepiece>=0.1.6
yaml

Finally, for the MT experiments, refer to the README in the recipes folder

Downloading and Preparing the Data

From this folder, run

# Monolingual en data from WMT17
bash scripts/download_en.sh config/data.en.config
bash scripts/prepare_model config/data.en.config

# Monolingual fr data from WMT15
bash scripts/download_fr.sh config/data.fr.config
bash scripts/prepare_model config/data.fr.config

# Prepare en<->fr parallel data
bash scripts/prepare-en-fr.sh config/data.fr.config path/to/moses/scripts

# Download and prepare the en<->ja monolingual and parallel data
bash scripts/download_ja.sh config/data.ja.config path/to/moses/scripts

# Download and extract MTNT
wget http://www.cs.cmu.edu/~pmichel1/hosting/MTNT.1.0.tar.gz && tar xvzf MTNT.1.0.tar.gz && rm MTNT.1.0.tar.gz
# Split the tsv files
bash MTNT/split_tsv.sh

You can edit the config/data.{en,fr,ja}.config files to change filenames, subword parameters, etc...

Running the Scraper

Edit the config/{en,fr,ja}_reddit.yaml to include the appropriate credentials for your bot. You can also change some of the parameters (like subreddits, etc...).

Then run

bash scripts/start_scraper.sh [config_file]`

When running the scraper, be mindful of the Reddit API terms.

Analysing the Data

You can analyse the collected data using the various scripts in the analysis folder, for example:

# Count the number of profanities (should return 38)
cat MTNT/test/test.en-fr.en | python3 analysis/count_keywords.py resources/profanities.en
# Count the number of emojis (should return 46)
cat MTNT/test/test.en-fr.en | python3 analysis/count_emojis.py
# Check the ration US/UK spelling (for ise/ize which is a good indicator) (should return 35.7% 64.3%)
cat MTNT/test/test.en-fr.en | python3 analysis/uk_us_ratio.py
# Count the number of informal pronouns (in japanese) (should return 35)
kytea -model /path/to/kytea/data/model.bin -out tok MTNT/test/test.ja-en.ja | python3 analysis/count_keywords.py resources/informal_pronouns.ja

Citing

If you use this code or the MTNT dataset, please cite the following publication:

@InProceedings{michel2018mtnt,
  author    = {Michel, Paul  and  Neubig, Graham},
  title     = {{MTNT}: A Testbed for {M}achine {T}ranslation of {N}oisy {T}ext},
  booktitle = {Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year      = {2018}
}

License

The code is released under the MIT License. The data is released under the terms of the Reddit API

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 48

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (0) 🔗