All Projects → memray → OpenNMT-kpg-release

memray / OpenNMT-kpg-release

Licence: MIT license
Keyphrase Generation

Programming Languages

Jupyter Notebook
11667 projects
python
139335 projects - #7 most used programming language
shell
77523 projects
perl
6916 projects
emacs lisp
2029 projects
smalltalk
420 projects

Projects that are alternatives of or similar to OpenNMT-kpg-release

perke
A keyphrase extractor for Persian
Stars: ✭ 60 (-62.96%)
Mutual labels:  keyword, keyphrase
keyphrase-generation-rl
Code for the ACL 19 paper "Neural Keyphrase Generation via Reinforcement Learning with Adaptive Rewards"
Stars: ✭ 95 (-41.36%)
Mutual labels:  keyphrase-generation, pytorch-implementation
kg one2set
Code for our ACL 2021 paper "One2Set: Generating Diverse Keyphrases as a Set"
Stars: ✭ 58 (-64.2%)
Mutual labels:  keyphrase-generation, pytorch-implementation
nvae
An unofficial toy implementation for NVAE 《A Deep Hierarchical Variational Autoencoder》
Stars: ✭ 83 (-48.77%)
Mutual labels:  pytorch-implementation
Magic-VNet
VNet for 3d volume segmentation
Stars: ✭ 45 (-72.22%)
Mutual labels:  pytorch-implementation
SPAN
Semantics-guided Part Attention Network (ECCV 2020 Oral)
Stars: ✭ 19 (-88.27%)
Mutual labels:  pytorch-implementation
SpinNet
[CVPR 2021] SpinNet: Learning a General Surface Descriptor for 3D Point Cloud Registration
Stars: ✭ 181 (+11.73%)
Mutual labels:  pytorch-implementation
WinForms-KWAssistant
百度搜索关键词,刷点击
Stars: ✭ 16 (-90.12%)
Mutual labels:  keyword
gan-vae-pretrained-pytorch
Pretrained GANs + VAEs + classifiers for MNIST/CIFAR in pytorch.
Stars: ✭ 134 (-17.28%)
Mutual labels:  pytorch-implementation
AdaSpeech
AdaSpeech: Adaptive Text to Speech for Custom Voice
Stars: ✭ 108 (-33.33%)
Mutual labels:  pytorch-implementation
ViT-V-Net for 3D Image Registration Pytorch
Vision Transformer for 3D medical image registration (Pytorch).
Stars: ✭ 169 (+4.32%)
Mutual labels:  pytorch-implementation
PyTorch
An open source deep learning platform that provides a seamless path from research prototyping to production deployment
Stars: ✭ 17 (-89.51%)
Mutual labels:  pytorch-implementation
ake-datasets
Large, curated set of benchmark datasets for evaluating automatic keyphrase extraction algorithms.
Stars: ✭ 125 (-22.84%)
Mutual labels:  keyphrase-generation
depth-map-prediction
Pytorch Implementation of Depth Map Prediction from a Single Image using a Multi-Scale Deep Network
Stars: ✭ 78 (-51.85%)
Mutual labels:  pytorch-implementation
tldr
TLDR is an unsupervised dimensionality reduction method that combines neighborhood embedding learning with the simplicity and effectiveness of recent self-supervised learning losses
Stars: ✭ 95 (-41.36%)
Mutual labels:  pytorch-implementation
Deep-MVLM
A tool for precisely placing 3D landmarks on 3D facial scans based on the paper "Multi-view Consensus CNN for 3D Facial Landmark Placement"
Stars: ✭ 71 (-56.17%)
Mutual labels:  pytorch-implementation
Pruning filters for efficient convnets
PyTorch implementation of "Pruning Filters For Efficient ConvNets"
Stars: ✭ 96 (-40.74%)
Mutual labels:  pytorch-implementation
TailCalibX
Pytorch implementation of Feature Generation for Long-Tail Classification by Rahul Vigneswaran, Marc T Law, Vineeth N Balasubramaniam and Makarand Tapaswi
Stars: ✭ 32 (-80.25%)
Mutual labels:  pytorch-implementation
DocTr
The official code for “DocTr: Document Image Transformer for Geometric Unwarping and Illumination Correction”, ACM MM, Oral Paper, 2021.
Stars: ✭ 202 (+24.69%)
Mutual labels:  pytorch-implementation
deep-keyphrase
seq2seq based keyphrase generation model sets, including copyrnn copycnn and copytransfomer
Stars: ✭ 51 (-68.52%)
Mutual labels:  keyphrase-generation

Keyphrase Generation (built on OpenNMT-py)

This is a repository providing code and datasets used in One Size Does Not Fit All: Generating and Evaluating Variable Number of Keyphrases and Does Order Matter? An Empirical Study on Generating Multiple Keyphrases as a Sequence.

All datasets and selected model checkpoints in the papers can be downloaded here (data.zip and models.zip). Unzip the file data.zip and models.zip and override the original data/ and model/ folder.

Update (April 2022)

Several pretrained checkpoints are available at Huggingface model repos. Paper will be available soon.

Two examples:

  • Run kpg inference with Huggingface Transformers
CUDA_VISIBLE_DEVICES=0 python onmt/keyphrase/run_infer_hfkpg.py --config_name memray/bart_wikikp --model_name_or_path memray/bart_wikikp --tokenizer_name memray/bart_wikikp --dataset_name midas/duc2001 --do_predict --output_dir kp_output/duc2001/ --overwrite_output_dir --per_device_eval_batch_size 8 --predict_with_generate --text_column document --keyphrase_column extractive_keyphrases --source_prefix <present>10<header>5<category>5<seealso>2<infill>0<s> --num_beams 5 --generation_max_length 60
  • Run kpg inference with OpenNMT-kpg (parameters are hard-coded)
CUDA_VISIBLE_DEVICES=0 python onmt/keyphrase/kpg_example_hfdatasets.py

Update (Jan 2022)

Merged with OpenNMT v2 and integrated a new pre-processing pipeline. Now training/inference can directly load JSON data from disk, without any hassle of tokenization or conversion to tensor files.

  • Paper datasets and DUC (download): KP20k/Inspec/Krapivin/NUS/SemEval2010/DUC2001.
  • 4 large annotated datasets (download): KP20k, OpenKP, KPTimes+JPTimes, StackExchange.

Some config examples can be of help for you to kick off:

  • Configs using RoBERTa subword tokenization. Vocab (including dict.txt/merges.txt/vocab.json/tokenizer.json) can be found here.
  • Configs using word tokenization. Vocab (magkp20k.vocab.json, 50k most frequent words in KP20k and MagKP) can be found here.

All shared resources are placed here. Please note that hf_vocab.tar.gz contains the vocab of subword tokenization (RoBERTa vocab with some new special tokens such as ), and magkp20k.vocab.json is for previous word tokenization based models (top 50k frequent words in magcs and kp20k).

Quickstart

All the config files used for training and evaluation can be found in folder config/. For more examples, you can refer to scripts placed in folder script/.

Train a One2Seq model

python train.py -config config/transfer_kp/train/transformer-presabs-kp20k.yml

Train a One2One model

python train.py -config config/transfer_kp/train/transformer-one2one-kp20k.yml

Run generation and evaluation

# beam search (beamwidth=50)
python kp_gen_eval_transfer.py -config config/transfer_kp/infer/keyphrase-one2seq.yml -tasks pred eval -data_dir kp/data/kp/json/ -exp_root_dir kp/exps/transformer_exp_devbest/ -gpu 0 -batch_size 16 -beam_size 50 -max_length 40 -testsets kp20k openkp kptimes jptimes stackex kp20k_valid2k openkp_valid2k kptimes_valid2k jptimes_valid2k stackex_valid2k duc -splits test --data_format jsonl -gpu 0

# greedy decoding (beamwidth=1)
python kp_gen_eval_transfer.py -config config/transfer_kp/infer/keyphrase-one2seq.yml -tasks pred eval -data_dir kp/data/kp/json/ -exp_root_dir kp/exps/transformer_exp_devbest/ -gpu 0 -batch_size 16 -beam_size 1 -max_length 40 -testsets kp20k openkp kptimes jptimes stackex kp20k_valid2k openkp_valid2k kptimes_valid2k jptimes_valid2k stackex_valid2k duc -splits test --data_format jsonl -gpu 0

Evaluation and Datasets

You may refer to notebook/json_process.ipynb to have a glance at the pre-processing.

We follow the data pre-processing and evaluation protocols in Meng et al. 2017. We pre-process both document texts and ground-truth keyphrases, including word segmentation, lowercasing and replacing all digits with symbol <digit>.

We manually clean the data examples in the valid/test set of KP20k (clean noisy text, replace erroneous keyphrases with actual author keyphrases, remove examples without any ground-truth keyphrases) and use scripts to remove invalid training examples (without any author keyphrase).

We evaluate models' performance on predicting present and absent phrases separately. Specifically, we first tokenize, lowercase and stem (using the Porter Stemmer of NLTK) the text, then we determine the presence of each ground-truth keyphrase by checking whether its words can be found verbatim in the source text.

To evaluate present phrase performance, we compute Precision/Recall/F1-score for each document taking only present ground-truth keyphrases as target and ignore the absent ones. We report the macro-averaged scores over documents that have at least one present ground-truth phrases (corresponding to the column #PreDoc in the Table below, and similarly to the case of absent phrase evaluation.

metrics

where #(pred) and #(target) are the number of predicted and ground-truth keyphrases respectively; and #(correct@k) is the number of correct predictions among the first k results.

We clarify that, since our study mainly focuses on keyword/keyphrase extraction/generation on short text, we only used the abstract of Semeval and NUS as source text. Therefore statistics like #PreKP may be different from the ones computed with fulltext, which also affect the final F1-scores. For the ease of reproduction, we post the detailed statistics in the following table and processed testsets with present/absent phrases split can be found in the released data (e.g. data/json/kp20k/kp20k_test_meng17token.json).

Dataset #Train #Valid #Test #KP #PreDoc #PreKP #AbsDoc #AbsKP
KP20k 514k 19,992 19,987 105,181 19,048 66,595 16,357 38,586
Inspec -- 1,500 500 4,913 497 3,858 381 1,055
Krapivin -- 1,844 460 2,641 437 1,485 417 1,156
NUS -- - 211 2,461 207 1,263 195 1,198
Semeval -- 144 100 1,507 100 671 99 836
StackEx 298k 16,000 16,000 43,131 13,475 24,809 10,984 18,322
DUC -- -- 308 2,484 308 2,421 38 63

Contributers

Major contributors are:

Citation

Please cite the following papers if you are interested in using our code and datasets.

@article{yuan2018onesizenotfit,
  title={One Size Does Not Fit All: Generating and Evaluating Variable Number of Keyphrases},
  author={Yuan, Xingdi and Wang, Tong and Meng, Rui and Thaker, Khushboo and He, Daqing and Trischler, Adam},
  booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
  url={https://arxiv.org/pdf/1810.05241.pdf},
  year={2020}
}
@article{meng2019ordermatters,
  title={Does Order Matter? An Empirical Study on Generating Multiple Keyphrases as a Sequence},
  author={Meng, Rui and Yuan, Xingdi and Wang, Tong and Brusilovsky, Peter and Trischler, Adam and He, Daqing},
  journal={arXiv preprint arXiv:1909.03590},
  url={https://arxiv.org/pdf/1909.03590.pdf},
  year={2019}
}
@inproceedings{meng2017kpgen,
  title={Deep keyphrase generation},
  author={Meng, Rui and Zhao, Sanqiang and Han, Shuguang and He, Daqing and Brusilovsky, Peter and Chi, Yu},
  booktitle={Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  pages={582--592},
  url={https://arxiv.org/pdf/1704.06879.pdf},
  year={2017}
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].