facebookresearch / Genre
Programming Languages
The GENRE (Generarive ENtity REtrieval) system as presented in Autoregressive Entity Retrieval implemented in pytorch.
@inproceedings{de2020autoregressive,
title={Autoregressive Entity Retrieval},
author={Nicola De Cao and Gautier Izacard and Sebastian Riedel and Fabio Petroni},
booktitle={International Conference on Learning Representations},
url={https://openreview.net/forum?id=5k8F6UU39V},
year={2021}
}
Please consider citing our work if you use code from this repository.
In a nutshell, GENRE uses a sequence-to-sequence approach to entity retrieval (e.g., linking), based on fine-tuned BART architecture. GENRE performs retrieval generating the unique entity name conditioned on the input text using constrained beam search to only generate valid identifiers. Here an example of generation for Wikipedia page retrieval for open-domain question answering:
For end-to-end entity linking GENRE re-generates the input text annoted with a markup:
GENRE achieves state-of-the-art results on multiple datasets.
Main dependencies
- python>=3.7
- pytorch>=1.6
- fairseq>=0.10 (for training -- optional for inference) NOTE: fairseq is going though changing without backward compatibility. Install
fairseq
from source and use this commit for reproducibilty. See here for the current PR that should fixfairseq/master
. - transformers>=4.2 (optional for inference)
Usage
See examples on how to use GENRE for both pytorch fairseq and huggingface transformers:
Generally, after importing and loading the model, you would generate predictions (in this example for Entity Disambiguation) with a simple call like:
model.sample(
sentences=[
"[START_ENT] Armstrong [END_ENT] was the first man on the Moon."
]
)
[[{'text': 'Neil Armstrong', 'logprob': tensor(-0.1443)},
{'text': 'William Armstrong', 'logprob': tensor(-1.4650)},
{'text': 'Scott Armstrong', 'logprob': tensor(-1.7311)},
{'text': 'Arthur Armstrong', 'logprob': tensor(-1.7356)},
{'text': 'Rob Armstrong', 'logprob': tensor(-1.7426)}]]
NOTE: we used fairseq for all experiments in the paper. The huggingface/transformers models are obtained with a conversion script similar to this. Therefore results might differ.
Models
Use the link above to download models in .tar.gz
format and then tar -zxvf <FILENAME>
do uncompress. As an alternative use this script to dowload all of them.
Entity Disambiguation
End-to-End Entity Linking
Training Dataset | pytorch / fairseq | huggingface / transformers |
---|---|---|
WIKIPEDIA | fairseq_e2e_entity_linking_wiki_abs | hf_e2e_entity_linking_wiki_abs |
WIKIPEDIA + AidaYago2 | fairseq_e2e_entity_linking_aidayago | hf_e2e_entity_linking_aidayago |
Document Retieval
Training Dataset | pytorch / fairseq | huggingface / transformers |
---|---|---|
KILT | fairseq_wikipage_retrieval | hf_wikipage_retrieval |
See here examples to load the models and make inference.
Dataset
Use the link above to download datasets. As an alternative use this script to dowload all of them. These dataset (except BLINK data) are a pre-processed version of Phong Le and Ivan Titov (2018) data availabe here. BLINK data taken from here.
Entity Disambiguation (train / dev)
- BLINK train (9,000,000 lines, 11GiB)
- BLINK dev (10,000 lines, 13MiB)
- AIDA-YAGO2 train (18,448 lines, 56MiB)
- AIDA-YAGO2 dev (4,791 lines, 15MiB)
Entity Disambiguation (test)
- ACE2004 (257 lines, 850KiB)
- AQUAINT (727 lines, 2.0MiB)
- AIDA-YAGO2 (4,485 lines, 14MiB)
- MSNBC (656 lines, 1.9MiB)
- WNED-CWEB (11,154 lines, 38MiB)
- WNED-WIKI (6,821 lines, 19MiB)
Document Retieval
- KILT for the these datasets please follow the download instruction on the KILT repository.
Pre-processing
To pre-process a KILT formatted dataset into source and target files as expected from fairseq
use
python scripts/convert_kilt_to_fairseq.py $INPUT_FILENAME $OUTPUT_FOLDER
Then, to tokenize and binarize them as expected from fairseq
use
./preprocess_fairseq.sh $DATASET_PATH $MODEL_PATH
note that this requires to have fairseq
source code downloaded in the same folder as the genre
repository (see here).
Trie from KILT Wikipedia titles
We also release the BPE prefix tree (trie) from KILT Wikipedia titles (kilt_titles_trie_dict.pkl) that is based on the 2019/08/01 Wikipedia dump, downloadable in its raw format here. The trie contains ~5M titles and it is used to generate entites for all the KILT experiments.
Troubleshooting
If the module cannot be found, preface the python command with PYTHONPATH=.
Licence
GENRE is licensed under the CC-BY-NC 4.0 license. The text of the license can be found here.