All Projects → butsugiri → Gec Pseudodata

butsugiri / Gec Pseudodata

Licence: mit
Repository of "An Empirical Study of Incorporating Pseudo Data into Grammatical Error Correction" (EMNLP-IJCNLP 2019)

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Gec Pseudodata

Nlp base
自然语言基础模型
Stars: ✭ 524 (+969.39%)
Mutual labels:  nlp-machine-learning
Click2analyze Androiddevchallenge
An app to analyze the text and fixing the anomaly of the message that deviates from what is standard, normal, or expected. #AndroidDevChallenge
Stars: ✭ 20 (-59.18%)
Mutual labels:  nlp-machine-learning
Coursera Natural Language Processing Specialization
Programming assignments from all courses in the Coursera Natural Language Processing Specialization offered by deeplearning.ai.
Stars: ✭ 39 (-20.41%)
Mutual labels:  nlp-machine-learning
Text Summarization Tensorflow
Tensorflow seq2seq Implementation of Text Summarization.
Stars: ✭ 527 (+975.51%)
Mutual labels:  encoder-decoder
Rasa Ui
Rasa UI is a frontend for the Rasa Framework
Stars: ✭ 796 (+1524.49%)
Mutual labels:  nlp-machine-learning
Sdtm mapper
AI SDTM mapping (R for ML, Python, TensorFlow for DL)
Stars: ✭ 27 (-44.9%)
Mutual labels:  nlp-machine-learning
Deeplearning.ai Natural Language Processing Specialization
This repository contains my full work and notes on Coursera's NLP Specialization (Natural Language Processing) taught by the instructor Younes Bensouda Mourri and Łukasz Kaiser offered by deeplearning.ai
Stars: ✭ 473 (+865.31%)
Mutual labels:  encoder-decoder
Mitie chinese wikipedia corpus
Pre-trained Wikipedia corpus by MITIE
Stars: ✭ 43 (-12.24%)
Mutual labels:  nlp-machine-learning
Deep Visual Attention Prediction
Keras implementation of paper 'Deep Visual Attention Prediction' which predicts human eye fixation on view-free scenes.
Stars: ✭ 19 (-61.22%)
Mutual labels:  encoder-decoder
Talismane
NLP framework: sentence detector, tokeniser, pos-tagger and dependency parser
Stars: ✭ 38 (-22.45%)
Mutual labels:  nlp-machine-learning
Chinese models for spacy
SpaCy 中文模型 | Models for SpaCy that support Chinese
Stars: ✭ 543 (+1008.16%)
Mutual labels:  nlp-machine-learning
Deeppavlov
An open source library for deep learning end-to-end dialog systems and chatbots.
Stars: ✭ 5,525 (+11175.51%)
Mutual labels:  nlp-machine-learning
Letslearnai.github.io
Lets Learn AI
Stars: ✭ 33 (-32.65%)
Mutual labels:  nlp-machine-learning
Ffmpegandroid
最新版ffmpeg3.3-android,并通过CMake方式移植到Android中,并实现编解码,转码,推拉流,滤镜等各种功能
Stars: ✭ 526 (+973.47%)
Mutual labels:  encoder-decoder
Tika Python
Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
Stars: ✭ 997 (+1934.69%)
Mutual labels:  nlp-machine-learning
Babyai
BabyAI platform. A testbed for training agents to understand and execute language commands.
Stars: ✭ 490 (+900%)
Mutual labels:  nlp-machine-learning
Banglatranslator
Bangla Machine Translator
Stars: ✭ 21 (-57.14%)
Mutual labels:  encoder-decoder
News push project
Real Time News Scraping and Recommendation System - React | Tensorflow | NLP | News Scrapers
Stars: ✭ 44 (-10.2%)
Mutual labels:  nlp-machine-learning
Predicting Myers Briggs Type Indicator With Recurrent Neural Networks
Stars: ✭ 43 (-12.24%)
Mutual labels:  nlp-machine-learning
Sockeye
Sequence-to-sequence framework with a focus on Neural Machine Translation based on Apache MXNet
Stars: ✭ 990 (+1920.41%)
Mutual labels:  encoder-decoder

pseudodata-for-gec

This is the official repository of following paper:

An Empirical Study of Incorporating Pseudo Data into Grammatical Error Correction
Shun Kiyono, Jun Suzuki, Masato Mita, Tomoya Mizumoto, Kentaro Inui
2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP2019), 2019 

Requirements

  • Python 3.6 or higher
  • PyTorch (version 1.0.1.post2 is recommended)
  • blingfire (for preprocessing - sentence splitting)
  • spaCy (for preprocessing - tokenization)
  • subword-nmt (for splitting the data into subwords)
  • fairseq (I used commit ID: 3658fa3 for all experiments. I strongly recommend sticking with the same commit ID.)

Resources

Reproducing the CoNLL2014/JFLEG/BEA-test Result

  • Download test-set from appropriate places.
  • Split source sentence into subwords using this bpe code file.
  • Run following command: output.txt is the decoded result.
#! /bin/sh
set -xe

cd /path/to/cloned/fairseq

# PATHs
CHECKPOINT="/path/to/downloaded/model.pt"  # avaiable at https://github.com/butsugiri/gec-pseudodata#resources
SRC_BPE="/path/to/src_file"  # this needs to be in subword
DATA_DIR="/path/to/vocab_dir"  # i.e., `vocab` dir in this repository

# Decoding
cat $SRC_BPE | python -u interactive.py ${DATA_DIR} \
    --path ${CHECKPOINT} \
    --source-lang src_bpe8000 \
    --target-lang trg_bpe8000 \
    --buffer-size 1024 \
    --batch-size 12 \
    --log-format simple \
    --beam 5 \
    --remove-bpe \
    | tee temp.txt

cat temp.txt | grep -e "^H" | cut -f1,3 | sed 's/^..//' | sort -n -k1  | cut -f2 > output.txt
rm temp.txt

The model pretlarge+SSE (finetuned) should achieve the score: F0.5=62.03 .

Generating Pseudo Data from Monolingual Corpus

Preprocessing

  • ssplit_and_tokenize.py applies sentence splitting and tokenization
  • remove_dirty_examples.py removes noisy examples (details are described in the script)

DirectNoise

  • cat monolingual_corpus.bpe | python count_unigram_freq.py > freq_file
  • python normalize_unigram_freq.py --norm 100 < freq_file > norm_freq_file
  • python generate_pseudo_samples.py -uf norm_freq_file -po 0.2 -pm 0.7 --single_mistake 0 --seed 2020 > proc_file
  • feed proc_file to fairseq_preprocess

Citing

If you use resources in this repository, please cite our paper.

@InProceedings{kiyono-etal-2019-empirical,
    title = "An Empirical Study of Incorporating Pseudo Data into Grammatical Error Correction",
    author = "Kiyono, Shun  and
      Suzuki, Jun  and
      Mita, Masato  and
      Mizumoto, Tomoya  and
      Inui, Kentaro",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)",
    month = nov,
    year = "2019",
    address = "Hong Kong, China",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/D19-1119",
    pages = "1236--1242",
    abstract = "The incorporation of pseudo data in the training of grammatical error correction models has been one of the main factors in improving the performance of such models. However, consensus is lacking on experimental configurations, namely, choosing how the pseudo data should be generated or used. In this study, these choices are investigated through extensive experiments, and state-of-the-art performance is achieved on the CoNLL-2014 test set (F0.5=65.0) and the official test set of the BEA-2019 shared task (F0.5=70.2) without making any modifications to the model architecture."
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].