awasthiabhijeet / PIE

Licence: MIT License
Fast + Non-Autoregressive Grammatical Error Correction using BERT. Code and Pre-trained models for paper "Parallel Iterative Edit Models for Local Sequence Transduction": www.aclweb.org/anthology/D19-1435.pdf (EMNLP-IJCNLP 2019)

Programming Languages

Macaulay2
6 projects
python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to PIE

Kashgari
Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.
Stars: ✭ 2,235 (+1262.8%)
Mutual labels:  bert, sequence-labeling, bert-model
SIGIR2021 Conure
One Person, One Model, One World: Learning Continual User Representation without Forgetting
Stars: ✭ 23 (-85.98%)
Mutual labels:  bert, bert-model
vietnamese-roberta
A Robustly Optimized BERT Pretraining Approach for Vietnamese
Stars: ✭ 22 (-86.59%)
Mutual labels:  bert, bert-embeddings
FinBERT-QA
Financial Domain Question Answering with pre-trained BERT Language Model
Stars: ✭ 70 (-57.32%)
Mutual labels:  bert, bert-model
Pytorch-NLU
Pytorch-NLU,一个中文文本分类、序列标注工具包,支持中文长文本、短文本的多类、多标签分类任务,支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech ta…
Stars: ✭ 151 (-7.93%)
Mutual labels:  bert, sequence-labeling
gender-unbiased BERT-based pronoun resolution
Source code for the ACL workshop paper and Kaggle competition by Google AI team
Stars: ✭ 42 (-74.39%)
Mutual labels:  bert, bert-model
BERT-BiLSTM-CRF
BERT-BiLSTM-CRF的Keras版实现
Stars: ✭ 40 (-75.61%)
Mutual labels:  bert, sequence-labeling
bert-tensorflow-pytorch-spacy-conversion
Instructions for how to convert a BERT Tensorflow model to work with HuggingFace's pytorch-transformers, and spaCy. This walk-through uses DeepPavlov's RuBERT as example.
Stars: ✭ 26 (-84.15%)
Mutual labels:  bert, bert-model
bert-movie-reviews-sentiment-classifier
Build a Movie Reviews Sentiment Classifier with Google's BERT Language Model
Stars: ✭ 12 (-92.68%)
Mutual labels:  bert, bert-model
siamese-BERT-fake-news-detection-LIAR
Triple Branch BERT Siamese Network for fake news classification on LIAR-PLUS dataset in PyTorch
Stars: ✭ 96 (-41.46%)
Mutual labels:  bert-model, bert-models
fairseq-tagging
a Fairseq fork for sequence tagging/labeling tasks
Stars: ✭ 26 (-84.15%)
Mutual labels:  sequence-labeling
syntaxdot
Neural syntax annotator, supporting sequence labeling, lemmatization, and dependency parsing.
Stars: ✭ 32 (-80.49%)
Mutual labels:  bert
policy-data-analyzer
Building a model to recognize incentives for landscape restoration in environmental policies from Latin America, the US and India. Bringing NLP to the world of policy analysis through an extensible framework that includes scraping, preprocessing, active learning and text analysis pipelines.
Stars: ✭ 22 (-86.59%)
Mutual labels:  bert
bert-as-a-service TFX
End-to-end pipeline with TFX to train and deploy a BERT model for sentiment analysis.
Stars: ✭ 32 (-80.49%)
Mutual labels:  bert
Mengzi
Mengzi Pretrained Models
Stars: ✭ 238 (+45.12%)
Mutual labels:  bert
KoBERT-nsmc
Naver movie review sentiment classification with KoBERT
Stars: ✭ 57 (-65.24%)
Mutual labels:  bert
MobileQA
离线端阅读理解应用 QA for mobile, Android & iPhone
Stars: ✭ 49 (-70.12%)
Mutual labels:  bert
semantic-document-relations
Implementation, trained models and result data for the paper "Pairwise Multi-Class Document Classification for Semantic Relations between Wikipedia Articles"
Stars: ✭ 21 (-87.2%)
Mutual labels:  bert
bangla-bert
Bangla-Bert is a pretrained bert model for Bengali language
Stars: ✭ 41 (-75%)
Mutual labels:  bert
iamQA
中文wiki百科QA阅读理解问答系统,使用了CCKS2016数据的NER模型和CMRC2018的阅读理解模型,还有W2V词向量搜索,使用torchserve部署
Stars: ✭ 46 (-71.95%)
Mutual labels:  bert

PIE: Parallel Iterative Edit Models for Local Sequence Transduction

Fast Grammatical Error Correction using BERT

Code and Pre-trained models accompanying our paper "Parallel Iterative Edit Models for Local Sequence Transduction" (EMNLP-IJCNLP 2019)

PIE is a BERT based architecture for local sequence transduction tasks like Grammatical Error Correction. Unlike the standard approach of modeling GEC as a task of translation from "incorrect" to "correct" language, we pose GEC as local sequence editing task. We further reduce local sequence editing problem to a sequence labeling setup where we utilize BERT to non-autoregressively label input tokens with edits. We rewire the BERT architecture (without retraining) specifically for the task of sequence editing. We find that PIE models for GEC are 5 to 15 times faster than existing state of the art architectures and still maintain a competitive accuracy. For more details please check out our EMNLP-IJCNLP 2019 paper

@inproceedings{awasthi-etal-2019-parallel,
    title = "Parallel Iterative Edit Models for Local Sequence Transduction",
    author = "Awasthi, Abhijeet  and
      Sarawagi, Sunita  and
      Goyal, Rasna  and
      Ghosh, Sabyasachi  and
      Piratla, Vihari",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)",
    month = nov,
    year = "2019",
    address = "Hong Kong, China",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/D19-1435",
    doi = "10.18653/v1/D19-1435",
    pages = "4259--4269",
}

Datasets

  • All the public GEC datasets used in the paper can be obtained from here
  • Synthetically created datasets (perturbed version of 1 billion word corpus) divided into 5 parts to independently train 5 different ensembles. (all the ensembles are further finetuned using the public GEC datasets mentioned above.)

Pretrained Models

  • PIE as reported in the paper
    • trained on a Synethically created GEC dataset starting with BERT's initialization
    • finetuned further on Lang8, NUCLE and FCE datasets
  • Inference using the pretrained PIE ckpt
    • Copy the pretrained checkpoint files provided above to PIE_ckpt directory
    • Your PIE_ckpt directory should contain
      • bert_config.json
      • multi_round_infer.sh
      • pie_infer.sh
      • pie_model.ckpt.data-00000-of-00001
      • pie_model.ckpt.index
      • pie_model.ckpt.meta
      • vocab.txt
    • Run: $ ./multi_round_infer.sh from PIE_ckpt directory
    • NOTE: If you are using cloud-TPUs for inference, move the PIE_ckpt directory to the cloud bucket and change the paths in pie_infer.sh and multi_round_infer.sh accordingly

Code Description

An example usage of code in described in the directory "example_scripts".

  • preprocess.sh
    • Extracts common insertions from a sample training data in the "scratch" directory
    • converts the training data in the form of incorrect tokens and aligned edits
  • pie_train.sh
    • trains a pie model using the converted training data
  • multi_round_infer.sh
    • uses a trained PIE model to obtain edits for incorrect sentences
    • does 4 rounds of iterative editing
    • uses conll-14 test sentences
  • m2_eval.sh
    • evaluates the final output using m2scorer
  • end_to_end.sh
    • describes the use of pre-processing, training, inference and evaluation scripts end to end.
  • More information in README.md inside "example_scripts"

Pre processing and Edits related

  • seq2edits_utils.py
    • contains implementation of edit-distance algorithm.
    • cost for substitution modified as per section A.1 in the paper.
    • Adapted from belambert's implimentation
  • get_edit_vocab.py : Extracts common insertions (\Sigma_a set as described in paper) from a parallel corpus
  • get_seq2edits.py : Extracts edits aligned to input tokens
  • tokenize_input.py : tokenize a file containing sentences. token_ids obtained go as input to the model.
  • opcodes.py : A class where members are all possible edit operations
  • transform_suffixes.py: Contains logic for suffix transformations
  • tokenization.py : Similar to BERT's implementation, with some GEC specific changes

PIE model (uses implementation of BERT of bert in Tensorflow)

  • word_edit_model.py: Implementation of PIE for learning from a parallel corpous of incorrect tokens and aligned edits.
    • logit factorization logic (Keep flag use_bert_more=True to enable logit factorization)
    • parallel sequence labeling
  • modeling.py : Same as in BERT's implementation
  • modified_modeling.py
    • Rewires attention mask to obtain representations of candidate appends and replacements
    • Used for logit factorization.
  • optimization.py : Same as in BERT's implementation

Post processing

  • apply_opcode.py
    • Applies inferred edits from the PIE model to the incorrect sentences.
    • Handles punctuations and spacings as per requirements of a standard dataset (INFER_MODE).
    • Contains some obvious rules for captialization etc.

Creating synthetic GEC dataset

  • errorify directory contains the scripts we used for perturbing the one-billion-word corpus

Acknowledgements

This research was partly sponsored by a Google India AI/ML Research Award and Google PhD Fellowship in Machine Learning. We gratefully acknowledge Google's TFRC program for providing us Cloud-TPUs. Thanks to Varun Patil for helping us improve the speed of pre-processing and synthetic-data generation pipelines.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].