All Projects → facebookresearch → Genre

facebookresearch / Genre

Licence: other
Autoregressive Entity Retrieval

Programming Languages

python
139335 projects - #7 most used programming language

The GENRE (Generarive ENtity REtrieval) system as presented in Autoregressive Entity Retrieval implemented in pytorch.

@inproceedings{de2020autoregressive,
  title={Autoregressive Entity Retrieval},
  author={Nicola De Cao and Gautier Izacard and Sebastian Riedel and Fabio Petroni},
  booktitle={International Conference on Learning Representations},
  url={https://openreview.net/forum?id=5k8F6UU39V},
  year={2021}
}

Please consider citing our work if you use code from this repository.

In a nutshell, GENRE uses a sequence-to-sequence approach to entity retrieval (e.g., linking), based on fine-tuned BART architecture. GENRE performs retrieval generating the unique entity name conditioned on the input text using constrained beam search to only generate valid identifiers. Here an example of generation for Wikipedia page retrieval for open-domain question answering:

For end-to-end entity linking GENRE re-generates the input text annoted with a markup:

GENRE achieves state-of-the-art results on multiple datasets.

Main dependencies

  • python>=3.7
  • pytorch>=1.6
  • fairseq>=0.10 (for training -- optional for inference) NOTE: fairseq is going though changing without backward compatibility. Install fairseq from source and use this commit for reproducibilty. See here for the current PR that should fix fairseq/master.
  • transformers>=4.2 (optional for inference)

Usage

See examples on how to use GENRE for both pytorch fairseq and huggingface transformers:

Generally, after importing and loading the model, you would generate predictions (in this example for Entity Disambiguation) with a simple call like:

model.sample(
    sentences=[
        "[START_ENT] Armstrong [END_ENT] was the first man on the Moon."
    ]
)
[[{'text': 'Neil Armstrong', 'logprob': tensor(-0.1443)},
  {'text': 'William Armstrong', 'logprob': tensor(-1.4650)},
  {'text': 'Scott Armstrong', 'logprob': tensor(-1.7311)},
  {'text': 'Arthur Armstrong', 'logprob': tensor(-1.7356)},
  {'text': 'Rob Armstrong', 'logprob': tensor(-1.7426)}]]

NOTE: we used fairseq for all experiments in the paper. The huggingface/transformers models are obtained with a conversion script similar to this. Therefore results might differ.

Models

Use the link above to download models in .tar.gz format and then tar -zxvf <FILENAME> do uncompress. As an alternative use this script to dowload all of them.

Entity Disambiguation

Training Dataset pytorch / fairseq huggingface / transformers
BLINK fairseq_entity_disambiguation_blink hf_entity_disambiguation_blink
BLINK + AidaYago2 fairseq_entity_disambiguation_aidayago hf_entity_disambiguation_aidayago

End-to-End Entity Linking

Training Dataset pytorch / fairseq huggingface / transformers
WIKIPEDIA fairseq_e2e_entity_linking_wiki_abs hf_e2e_entity_linking_wiki_abs
WIKIPEDIA + AidaYago2 fairseq_e2e_entity_linking_aidayago hf_e2e_entity_linking_aidayago

Document Retieval

Training Dataset pytorch / fairseq huggingface / transformers
KILT fairseq_wikipage_retrieval hf_wikipage_retrieval

See here examples to load the models and make inference.

Dataset

Use the link above to download datasets. As an alternative use this script to dowload all of them. These dataset (except BLINK data) are a pre-processed version of Phong Le and Ivan Titov (2018) data availabe here. BLINK data taken from here.

Entity Disambiguation (train / dev)

Entity Disambiguation (test)

Document Retieval

  • KILT for the these datasets please follow the download instruction on the KILT repository.

Pre-processing

To pre-process a KILT formatted dataset into source and target files as expected from fairseq use

python scripts/convert_kilt_to_fairseq.py $INPUT_FILENAME $OUTPUT_FOLDER

Then, to tokenize and binarize them as expected from fairseq use

./preprocess_fairseq.sh $DATASET_PATH $MODEL_PATH

note that this requires to have fairseq source code downloaded in the same folder as the genre repository (see here).

Trie from KILT Wikipedia titles

We also release the BPE prefix tree (trie) from KILT Wikipedia titles (kilt_titles_trie_dict.pkl) that is based on the 2019/08/01 Wikipedia dump, downloadable in its raw format here. The trie contains ~5M titles and it is used to generate entites for all the KILT experiments.

Troubleshooting

If the module cannot be found, preface the python command with PYTHONPATH=.

Licence

GENRE is licensed under the CC-BY-NC 4.0 license. The text of the license can be found here.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].