All Projects → naver-ai → MetricMT

naver-ai / MetricMT

Licence: MIT license
The official code repository for MetricMT - a reward optimization method for NMT with learned metrics

Projects that are alternatives of or similar to MetricMT

Texar
Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/
Stars: ✭ 2,236 (+9621.74%)
Mutual labels:  machine-translation
transformer
Build English-Vietnamese machine translation with ProtonX Transformer. :D
Stars: ✭ 41 (+78.26%)
Mutual labels:  machine-translation
OPUS-MT-train
Training open neural machine translation models
Stars: ✭ 166 (+621.74%)
Mutual labels:  machine-translation
Bleualign
Machine-Translation-based sentence alignment tool for parallel text
Stars: ✭ 199 (+765.22%)
Mutual labels:  machine-translation
Modernmt
Neural Adaptive Machine Translation that adapts to context and learns from corrections.
Stars: ✭ 231 (+904.35%)
Mutual labels:  machine-translation
sb-nmt
Code for Synchronous Bidirectional Neural Machine Translation (SB-NMT)
Stars: ✭ 66 (+186.96%)
Mutual labels:  machine-translation
Spark Nlp
State of the Art Natural Language Processing
Stars: ✭ 2,518 (+10847.83%)
Mutual labels:  machine-translation
tai5-uan5 gian5-gi2 kang1-ku7
臺灣言語工具
Stars: ✭ 79 (+243.48%)
Mutual labels:  machine-translation
ibleu
A visual and interactive scoring environment for machine translation systems.
Stars: ✭ 27 (+17.39%)
Mutual labels:  machine-translation
tvsub
TVsub: DCU-Tencent Chinese-English Dialogue Corpus
Stars: ✭ 40 (+73.91%)
Mutual labels:  machine-translation
Attention Mechanisms
Implementations for a family of attention mechanisms, suitable for all kinds of natural language processing tasks and compatible with TensorFlow 2.0 and Keras.
Stars: ✭ 203 (+782.61%)
Mutual labels:  machine-translation
Opennmt
Open Source Neural Machine Translation in Torch (deprecated)
Stars: ✭ 2,339 (+10069.57%)
Mutual labels:  machine-translation
osdg-tool
OSDG is an open-source tool that maps and connects activities to the UN Sustainable Development Goals (SDGs) by identifying SDG-relevant content in any text. The tool is available online at www.osdg.ai. API access available for research purposes.
Stars: ✭ 22 (-4.35%)
Mutual labels:  machine-translation
Lingvo
Lingvo
Stars: ✭ 2,361 (+10165.22%)
Mutual labels:  machine-translation
extreme-adaptation-for-personalized-translation
Code for the paper "Extreme Adaptation for Personalized Neural Machine Translation"
Stars: ✭ 42 (+82.61%)
Mutual labels:  machine-translation
Npmt
Towards Neural Phrase-based Machine Translation
Stars: ✭ 175 (+660.87%)
Mutual labels:  machine-translation
apertium-apy
📦 Apertium HTTP Server in Python
Stars: ✭ 29 (+26.09%)
Mutual labels:  machine-translation
skt
Sanskrit compound segmentation using seq2seq model
Stars: ✭ 21 (-8.7%)
Mutual labels:  machine-translation
Distill-BERT-Textgen
Research code for ACL 2020 paper: "Distilling Knowledge Learned in BERT for Text Generation".
Stars: ✭ 121 (+426.09%)
Mutual labels:  machine-translation
bergamot-translator
Cross platform C++ library focusing on optimized machine translation on the consumer-grade device.
Stars: ✭ 181 (+686.96%)
Mutual labels:  machine-translation

MetricMT - Reward Optimization for Neural Machine Translation with Learned Metrics

This is our official code repository. To read the paper, please see (arxiv).

Authors: Raphael Shu, Kang Min Yoo and Jung-Woo Ha (NAVER AI Lab)

What is it about

In short, we optimize NMT models with the state-of-the-art metric, BLEURT, and found the translations to have higher adequacy and coverage compared to both the baseline and models trained with BLEU.

In machine translation, BLEU has been a dominating evaluation metric for years. However, the criticism on BLEU dates back as early as 2006 (Callison-Burch et al., 2006). The best overall paper of ACL 2020 (Mathur, 2020) again shows that BLEU's correlation with human drops to zero or negative territory when comparing only a few top tier systems. The author calls for stopping the use of BLEU in the paper.

Recently, several model-based metrics are proposed (ESIM, Yisi-1, BERTScore, BLEURT). They are all using or building with BERT. These metrics typically achieve much higher human correlation by tuning themselves with human judgment data.

In our paper, we attempt to directly optimize NMT models with the state-of-the-art learned metric, BLEURT. The benefit is obvious, as BLEURT is tuned with human scores, it can potentially reflect human preference on translation quality. We want to know whether the training just changes the NMT parameter to hack the metric, or it yields meaningful improvement.

For reward optimization, we found a stable ranking-based sequence-level loss performs well and is suitable to use with large NMT and metric models.

How it works

We propose to use the following contrastive-margin loss, which is a pairwise ranking loss that differentiates two candidates with the best and worst rewrad in a candidate space. The loss has the following form:

Here, ql_69480ecf125de512baaae19eee3ac7ab_l3 is the reward function. After we obtain a set of candidates using beam search, ql_c06661d77e2e10c2d1f7b60157aa98de_l3 denotes the candidate with the best reward. ql_9fbd1e8e69cb5b6061fb4bf273485b6d_l3 is the candidate with the worst reward.

This reward optimizing loss has a lower memory footprint comparing with risk minimization loss, and is more stable than REINFORCE and max-margin loss. In the paper, we show this loss can effectively optimize both smoothed BLEU and BLEURT as rewards.

Results

We perform automatic and human evaluations to compare optimized models with the baselines. The experiments are conducted on German-English, Romanian-English, Russian-English and Japanese-English datasets. They are all to-English datasets as the pretrained BLEURT is for English language.

The results are interesting. In three over four language pairs, we found BLEURT is significantly increased after optimizing it, however, this optimization hurts BLEU. Here are the automatic scores:

Automatic Evaluation

Then we performed pairwise human evaluation on three criteria: adequacy, fluency and coverage. These are the results

Human Evaluation

We can see that the BLEURT optimized model tends to have better adequacy and coverage, and it performs better than models trained with smoothed BLEU. For fluency, annotators didn't find much difference overall, which may indicate the NLL loss is already good at improving fluency. Please check our paper for more details.

Getting Started

Our method can be applied to any MT metrics (including non-differentiable ones) for improving human perceived quality. We invite others to try our method with various metrics!

We will release the source code to reproduce our method very soon. Stay tuned!

Citing our Work

@article{shu2021reward,
    title={Reward Optimization for Neural Machine Translation with Learned Metrics},
    author={Shu, Raphael and Yoo, Kang Min and Ha, Jung-Woo},
    year={2021},
    journal={arXiv preprint arXiv:2104.07541},
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].