All Projects → Tiiiger → Bert_score

Tiiiger / Bert_score

Licence: mit
BERT score for text generation

Projects that are alternatives of or similar to Bert score

Nlp Python Deep Learning
NLP in Python with Deep Learning
Stars: ✭ 374 (-34.15%)
Mutual labels:  jupyter-notebook, natural-language-processing
Practical Pytorch
Go to https://github.com/pytorch/tutorials - this repo is deprecated and no longer maintained
Stars: ✭ 4,329 (+662.15%)
Mutual labels:  jupyter-notebook, natural-language-processing
Transformers Tutorials
Github repo with tutorials to fine tune transformers for diff NLP tasks
Stars: ✭ 384 (-32.39%)
Mutual labels:  jupyter-notebook, natural-language-processing
Nlp Papers With Arxiv
Statistics and accepted paper list of NLP conferences with arXiv link
Stars: ✭ 345 (-39.26%)
Mutual labels:  jupyter-notebook, natural-language-processing
Xlnet Pytorch
Simple XLNet implementation with Pytorch Wrapper
Stars: ✭ 501 (-11.8%)
Mutual labels:  jupyter-notebook, natural-language-processing
Question generation
Neural question generation using transformers
Stars: ✭ 356 (-37.32%)
Mutual labels:  jupyter-notebook, natural-language-processing
Code search
Code For Medium Article: "How To Create Natural Language Semantic Search for Arbitrary Objects With Deep Learning"
Stars: ✭ 436 (-23.24%)
Mutual labels:  jupyter-notebook, natural-language-processing
Nlp Tutorial
Tutorial: Natural Language Processing in Python
Stars: ✭ 274 (-51.76%)
Mutual labels:  jupyter-notebook, natural-language-processing
Ml Mipt
Open Machine Learning course at MIPT
Stars: ✭ 480 (-15.49%)
Mutual labels:  jupyter-notebook, natural-language-processing
Courses
Quiz & Assignment of Coursera
Stars: ✭ 454 (-20.07%)
Mutual labels:  jupyter-notebook, natural-language-processing
Biosentvec
BioWordVec & BioSentVec: pre-trained embeddings for biomedical words and sentences
Stars: ✭ 308 (-45.77%)
Mutual labels:  jupyter-notebook, natural-language-processing
Hate Speech And Offensive Language
Repository for the paper "Automated Hate Speech Detection and the Problem of Offensive Language", ICWSM 2017
Stars: ✭ 543 (-4.4%)
Mutual labels:  jupyter-notebook, natural-language-processing
Zhihu
This repo contains the source code in my personal column (https://zhuanlan.zhihu.com/zhaoyeyu), implemented using Python 3.6. Including Natural Language Processing and Computer Vision projects, such as text generation, machine translation, deep convolution GAN and other actual combat code.
Stars: ✭ 3,307 (+482.22%)
Mutual labels:  jupyter-notebook, natural-language-processing
Data Science
Collection of useful data science topics along with code and articles
Stars: ✭ 315 (-44.54%)
Mutual labels:  jupyter-notebook, natural-language-processing
Adaptnlp
An easy to use Natural Language Processing library and framework for predicting, training, fine-tuning, and serving up state-of-the-art NLP models.
Stars: ✭ 278 (-51.06%)
Mutual labels:  jupyter-notebook, natural-language-processing
Anlp19
Course repo for Applied Natural Language Processing (Spring 2019)
Stars: ✭ 402 (-29.23%)
Mutual labels:  jupyter-notebook, natural-language-processing
Bertviz
Tool for visualizing attention in the Transformer model (BERT, GPT-2, Albert, XLNet, RoBERTa, CTRL, etc.)
Stars: ✭ 3,443 (+506.16%)
Mutual labels:  jupyter-notebook, natural-language-processing
Nlpython
This repository contains the code related to Natural Language Processing using python scripting language. All the codes are related to my book entitled "Python Natural Language Processing"
Stars: ✭ 265 (-53.35%)
Mutual labels:  jupyter-notebook, natural-language-processing
Practical Nlp
Official Repository for 'Practical Natural Language Processing' by O'Reilly Media
Stars: ✭ 452 (-20.42%)
Mutual labels:  jupyter-notebook, natural-language-processing
Nlp Notebooks
A collection of notebooks for Natural Language Processing from NLP Town
Stars: ✭ 513 (-9.68%)
Mutual labels:  jupyter-notebook, natural-language-processing

BERTScore

made-with-python PyPI version bert-score Downloads Downloads License: MIT Code style: black

Automatic Evaluation Metric described in the paper BERTScore: Evaluating Text Generation with BERT (ICLR 2020).

News:

  • Features to appear in the next version (currently in the master branch):

    • Fix bugs for mBART
    • Support 4 mT5 models as requested (#93)
  • Updated to version 0.3.8

    • Support 53 new pretrained models including BART, mBART, BORT, DeBERTa, T5, BERTweet, MPNet, ConvBERT, SqueezeBERT, SpanBERT, PEGASUS, Longformer, LED, Blendbot, etc. Among them, DeBERTa achives higher correlation with human scores than RoBERTa (our default) on WMT16 dataset. The correlations are presented in this Google sheet.
    • Please consider using --model_type microsoft/deberta-xlarge-mnli or --model_type microsoft/deberta-large-mnli (faster) if you want the scores to correlate better with human scores.
    • Add baseline files for DeBERTa models.
    • Add example code to generate baseline files (please see the details).
  • Updated to version 0.3.7

    • Being compatible with Huggingface's transformers version >=4.0.0. Thanks to public contributers (#84, #85, #86).
  • See #22 if you want to replicate our experiments on the COCO Captioning dataset.

  • For people in China, downloading pre-trained weights can be very slow. We provide copies of a few models on Baidu Pan.

  • Huggingface's datasets library includes BERTScore in their metric collection.

Previous updates

  • Updated to version 0.3.6
    • Support custom baseline files #74
    • The option --rescale-with-baseline is changed to --rescale_with_baseline so that it is consistent with other options.
  • Updated to version 0.3.5
    • Being compatible with Huggingface's transformers >=v3.0.0 and minor fixes (#58, #66, #68)
    • Several improvements related to efficency (#67, #69)
  • Updated to version 0.3.4
    • Compatible with transformers v2.11.0 now (#58)
  • Updated to version 0.3.3
    • Fixing the bug with empty strings issue #47.
    • Supporting 6 ELECTRA models and 24 smaller BERT models.
    • A new Google sheet for keeping the performance (i.e., pearson correlation with human judgment) of different models on WMT16 to-English.
    • Including the script for tuning the best number of layers of an English pre-trained model on WMT16 to-English data (See the details).
  • Updated to version 0.3.2
    • Bug fixed: fixing the bug in v0.3.1 when having multiple reference sentences.
    • Supporting multiple reference sentences with our command line tool.
  • Updated to version 0.3.1
    • A new BERTScorer object that caches the model to avoid re-loading it multiple times. Please see our jupyter notebook example for the usage.
    • Supporting multiple reference sentences for each example. The score function now can take a list of lists of strings as the references and return the score between the candidate sentence and its closest reference sentence.

Please see release logs for older updates.

Authors:

*: Equal Contribution

Overview

BERTScore leverages the pre-trained contextual embeddings from BERT and matches words in candidate and reference sentences by cosine similarity. It has been shown to correlate with human judgment on sentence-level and system-level evaluation. Moreover, BERTScore computes precision, recall, and F1 measure, which can be useful for evaluating different language generation tasks.

For an illustration, BERTScore recall can be computed as

If you find this repo useful, please cite:

@inproceedings{bert-score,
  title={BERTScore: Evaluating Text Generation with BERT},
  author={Tianyi Zhang* and Varsha Kishore* and Felix Wu* and Kilian Q. Weinberger and Yoav Artzi},
  booktitle={International Conference on Learning Representations},
  year={2020},
  url={https://openreview.net/forum?id=SkeHuCVFDr}
}

Installation

  • Python version >= 3.6
  • PyTorch version >= 1.0.0

Install from pypi with pip by

pip install bert-score

Install latest unstable version from the master branch on Github by:

pip install git+https://github.com/Tiiiger/bert_score

Install it from the source by:

git clone https://github.com/Tiiiger/bert_score
cd bert_score
pip install .

and you may test your installation by:

python -m unittest discover

Usage

Python Function

On a high level, we provide a python function bert_score.score and a python object bert_score.BERTScorer. The function provides all the supported features while the scorer object caches the BERT model to faciliate multiple evaluations. Check our demo to see how to use these two interfaces. Please refer to bert_score/score.py for implementation details.

Running BERTScore can be computationally intensive (because it uses BERT :p). Therefore, a GPU is usually necessary. If you don't have access to a GPU, you can try our demo on Google Colab

Command Line Interface (CLI)

We provide a command line interface (CLI) of BERTScore as well as a python module. For the CLI, you can use it as follows:

  1. To evaluate English text files:

We provide example inputs under ./example.

bert-score -r example/refs.txt -c example/hyps.txt --lang en

You will get the following output at the end:

roberta-large_L17_no-idf_version=0.3.0(hug_trans=2.3.0) P: 0.957378 R: 0.961325 F1: 0.959333

where "roberta-large_L17_no-idf_version=0.3.0(hug_trans=2.3.0)" is the hash code.

Starting from version 0.3.0, we support rescaling the scores with baseline scores

bert-score -r example/refs.txt -c example/hyps.txt --lang en --rescale_with_baseline

You will get:

roberta-large_L17_no-idf_version=0.3.0(hug_trans=2.3.0)-rescaled P: 0.747044 R: 0.770484 F1: 0.759045

This makes the range of the scores larger and more human-readable. Please see this post for details.

When having multiple reference sentences, please use

bert-score -r example/refs.txt example/refs2.txt -c example/hyps.txt --lang en

where the -r argument supports an arbitrary number of reference files. Each reference file should have the same number of lines as your candidate/hypothesis file. The i-th line in each reference file corresponds to the i-th line in the candidate file.

  1. To evaluate text files in other languages:

We currently support the 104 languages in multilingual BERT (full list).

Please specify the two-letter abbreviation of the language. For instance, using --lang zh for Chinese text.

See more options by bert-score -h.

  1. To load your own custom model: Please specify the path to the model and the number of layers to use by --model and --num_layers.
bert-score -r example/refs.txt -c example/hyps.txt --model path_to_my_bert --num_layers 9
  1. To visualize matching scores:
bert-score-show --lang en -r "There are two bananas on the table." -c "On the table are two apples." -f out.png

The figure will be saved to out.png.

Practical Tips

  • Report the hash code (e.g., roberta-large_L17_no-idf_version=0.3.0(hug_trans=2.3.0)-rescaled) in your paper so that people know what setting you use. This is inspired by sacreBLEU. Changes in huggingface's transformers version may also affect the score (See issue #46).
  • Unlike BERT, RoBERTa uses GPT2-style tokenizer which creates addition " " tokens when there are multiple spaces appearing together. It is recommended to remove addition spaces by sent = re.sub(r' +', ' ', sent) or sent = re.sub(r'\s+', ' ', sent).
  • Using inverse document frequency (idf) on the reference sentences to weigh word importance may correlate better with human judgment. However, when the set of reference sentences become too small, the idf score would become inaccurate/invalid. We now make it optional. To use idf, please set --idf when using the CLI tool or idf=True when calling bert_score.score function.
  • When you are low on GPU memory, consider setting batch_size when calling bert_score.score function.
  • To use a particular model please set -m MODEL_TYPE when using the CLI tool or model_type=MODEL_TYPE when calling bert_score.score function.
  • We tune layer to use based on WMT16 metric evaluation dataset. You may use a different layer by setting -l LAYER or num_layers=LAYER. To tune the best layer for your custom model, please follow the instructions in tune_layers folder.
  • Limitation: Because BERT, RoBERTa, and XLM with learned positional embeddings are pre-trained on sentences with max length 512, BERTScore is undefined between sentences longer than 510 (512 after adding [CLS] and [SEP] tokens). The sentences longer than this will be truncated. Please consider using XLNet which can support much longer inputs.

Default Behavior

Default Model

Language Model
en roberta-large
en-sci scibert-scivocab-uncased
zh bert-base-chinese
others bert-base-multilingual-cased

Default Layers

Please see this Google sheet for the supported models and their performance.

Acknowledgement

This repo wouldn't be possible without the awesome bert, fairseq, and transformers.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].