Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

grammarly / Gector

Licence: apache-2.0

Official implementation of the paper “GECToR – Grammatical Error Correction: Tag, Not Rewrite” // Published on BEA15 Workshop (co-located with ACL 2020) https://www.aclweb.org/anthology/2020.bea-1.16.pdf

Programming Languages

python

139335 projects - #7 most used programming language

Labels

natural-language-processing sequence-labeling

Projects that are alternatives of or similar to Gector

Neuronblocks

NLP DNN Toolkit - Building Your NLP DNN Models Like Playing Lego

Stars: ✭ 1,356 (+372.47%)

Mutual labels: natural-language-processing, sequence-labeling

Neuronlp2

Deep neural models for core NLP tasks (Pytorch version)

Stars: ✭ 397 (+38.33%)

Mutual labels: natural-language-processing, sequence-labeling

Ncrfpp

NCRF++, a Neural Sequence Labeling Toolkit. Easy use to any sequence labeling tasks (e.g. NER, POS, Segmentation). It includes character LSTM/CNN, word LSTM/CNN and softmax/CRF components.

Stars: ✭ 1,767 (+515.68%)

Mutual labels: natural-language-processing, sequence-labeling

Seqeval

A Python framework for sequence labeling evaluation(named-entity recognition, pos tagging, etc...)

Stars: ✭ 508 (+77%)

Mutual labels: natural-language-processing, sequence-labeling

Anago

Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging and so on.

Stars: ✭ 1,392 (+385.02%)

Mutual labels: natural-language-processing, sequence-labeling

Flair

A very simple framework for state-of-the-art Natural Language Processing (NLP)

Stars: ✭ 11,065 (+3755.4%)

Mutual labels: natural-language-processing, sequence-labeling

Prosody

Helsinki Prosody Corpus and A System for Predicting Prosodic Prominence from Text

Stars: ✭ 139 (-51.57%)

Mutual labels: natural-language-processing, sequence-labeling

Bluebert

BlueBERT, pre-trained on PubMed abstracts and clinical notes (MIMIC-III).

Stars: ✭ 273 (-4.88%)

Mutual labels: natural-language-processing

Hscrf Pytorch

ACL 2018: Hybrid semi-Markov CRF for Neural Sequence Labeling (http://aclweb.org/anthology/P18-2038)

Stars: ✭ 284 (-1.05%)

Mutual labels: sequence-labeling

Nlp tasks

Natural Language Processing Tasks and References

Stars: ✭ 2,968 (+934.15%)

Mutual labels: natural-language-processing

Recurrent Entity Networks

TensorFlow implementation of "Tracking the World State with Recurrent Entity Networks".

Stars: ✭ 276 (-3.83%)

Mutual labels: natural-language-processing

Rnnsharp

RNNSharp is a toolkit of deep recurrent neural network which is widely used for many different kinds of tasks, such as sequence labeling, sequence-to-sequence and so on. It's written by C# language and based on .NET framework 4.6 or above versions. RNNSharp supports many different types of networks, such as forward and bi-directional network, sequence-to-sequence network, and different types of layers, such as LSTM, Softmax, sampled Softmax and others.

Stars: ✭ 277 (-3.48%)

Mutual labels: sequence-labeling

Link Grammar

The CMU Link Grammar natural language parser

Stars: ✭ 286 (-0.35%)

Mutual labels: natural-language-processing

Pyswip

PySwip is a Python - SWI-Prolog bridge enabling to query SWI-Prolog in your Python programs. It features an (incomplete) SWI-Prolog foreign language interface, a utility class that makes it easy querying with Prolog and also a Pythonic interface.

Stars: ✭ 276 (-3.83%)

Mutual labels: natural-language-processing

Trade Dst

Source code for transferable dialogue state generator (TRADE, Wu et al., 2019). https://arxiv.org/abs/1905.08743

Stars: ✭ 287 (+0%)

Mutual labels: natural-language-processing

Autonlp

🤗 AutoNLP: train state-of-the-art natural language processing models and deploy them in a scalable environment automatically

Stars: ✭ 263 (-8.36%)

Mutual labels: natural-language-processing

Ner

Named Entity Recognition

Stars: ✭ 288 (+0.35%)

Mutual labels: natural-language-processing

Text2sql Data

A collection of datasets that pair questions with SQL queries.

Stars: ✭ 287 (+0%)

Mutual labels: natural-language-processing

Rnn For Joint Nlu

Tensorflow implementation of "Attention-Based Recurrent Neural Network Models for Joint Intent Detection and Slot Filling" (https://arxiv.org/abs/1609.01454)

Stars: ✭ 281 (-2.09%)

Mutual labels: sequence-labeling

Languagecrunch

LanguageCrunch NLP server docker image

Stars: ✭ 281 (-2.09%)

Mutual labels: natural-language-processing

View All Similar Projects ➔

GECToR – Grammatical Error Correction: Tag, Not Rewrite

This repository provides code for training and testing state-of-the-art models for grammatical error correction with the official PyTorch implementation of the following paper:

GECToR – Grammatical Error Correction: Tag, Not Rewrite
Kostiantyn Omelianchuk, Vitaliy Atrasevych, Artem Chernodub, Oleksandr Skurzhanskyi
Grammarly
15th Workshop on Innovative Use of NLP for Building Educational Applications (co-located with ACL 2020)

It is mainly based on AllenNLP and transformers.

Installation

The following command installs all necessary packages:

pip install -r requirements.txt

The project was tested using Python 3.7.

Datasets

All the public GEC datasets used in the paper can be downloaded from here.
Synthetically created datasets can be generated/downloaded here.
To train the model data has to be preprocessed and converted to special format with the command:

python utils/preprocess_data.py -s SOURCE -t TARGET -o OUTPUT_FILE

Pretrained models

Pretrained encoder	Confidence bias	Min error prob	CoNNL-2014 (test)	BEA-2019 (test)
BERT [link]	0.10	0.41	63.0	67.6
RoBERTa [link]	0.20	0.50	64.0	71.5
XLNet [link]	0.35	0.66	65.3	72.4
RoBERTa + XLNet	0.24	0.45	66.0	73.7
BERT + RoBERTa + XLNet	0.16	0.40	66.5	73.6

Train model

To train the model, simply run:

python train.py --train_set TRAIN_SET --dev_set DEV_SET \
                --model_dir MODEL_DIR

There are a lot of parameters to specify among them:

cold_steps_count the number of epochs where we train only last linear layer
transformer_model {bert,distilbert,gpt2,roberta,transformerxl,xlnet,albert} model encoder
tn_prob probability of getting sentences with no errors; helps to balance precision/recall
pieces_per_token maximum number of subwords per token; helps not to get CUDA out of memory

In our experiments we had 98/2 train/dev split.

Training parameters

We described all parameters that we use for training and evaluating here.

Model inference

To run your model on the input file use the following command:

python predict.py --model_path MODEL_PATH [MODEL_PATH ...] \
                  --vocab_path VOCAB_PATH --input_file INPUT_FILE \
                  --output_file OUTPUT_FILE

Among parameters:

min_error_probability - minimum error probability (as in the paper)
additional_confidence - confidence bias (as in the paper)
special_tokens_fix to reproduce some reported results of pretrained models

For evaluation use M^2Scorer and ERRANT.

Text Simplification

The code and README for Text Simplification version of GECToR will be added soon.

Citation

If you find this work is useful for your research, please cite our paper:

@inproceedings{omelianchuk-etal-2020-gector,
    title = "{GECT}o{R} {--} Grammatical Error Correction: Tag, Not Rewrite",
    author = "Omelianchuk, Kostiantyn  and
      Atrasevych, Vitaliy  and
      Chernodub, Artem  and
      Skurzhanskyi, Oleksandr",
    booktitle = "Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications",
    month = jul,
    year = "2020",
    address = "Seattle, WA, USA â†’ Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.bea-1.16",
    pages = "163--170",
    abstract = "In this paper, we present a simple and efficient GEC sequence tagger using a Transformer encoder. Our system is pre-trained on synthetic data and then fine-tuned in two stages: first on errorful corpora, and second on a combination of errorful and error-free parallel corpora. We design custom token-level transformations to map input tokens to target corrections. Our best single-model/ensemble GEC tagger achieves an F{\_}0.5 of 65.3/66.5 on CONLL-2014 (test) and F{\_}0.5 of 72.4/73.6 on BEA-2019 (test). Its inference speed is up to 10 times as fast as a Transformer-based seq2seq GEC system.",
}

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 287

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (1) 🔗