All Projects → boudinfl → ake-datasets

boudinfl / ake-datasets

Licence: Apache-2.0 license
Large, curated set of benchmark datasets for evaluating automatic keyphrase extraction algorithms.

Programming Languages

shell
77523 projects
python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to ake-datasets

kex
Kex is a python library for unsupervised keyword extraction from a document, providing an easy interface and benchmarks on 15 public datasets.
Stars: ✭ 46 (-63.2%)
Mutual labels:  information-retrieval, keyword-extraction, nlp-machine-learning
perke
A keyphrase extractor for Persian
Stars: ✭ 60 (-52%)
Mutual labels:  information-retrieval, keyword-extraction, keyphrase-extraction
deep-keyphrase
seq2seq based keyphrase generation model sets, including copyrnn copycnn and copytransfomer
Stars: ✭ 51 (-59.2%)
Mutual labels:  keyword-extraction, keyphrase-extraction, keyphrase-generation
Wongnai Corpus
Collection of Wongnai's datasets
Stars: ✭ 57 (-54.4%)
Mutual labels:  datasets, nlp-machine-learning
Codesearchnet
Datasets, tools, and benchmarks for representation learning of code.
Stars: ✭ 1,378 (+1002.4%)
Mutual labels:  datasets, nlp-machine-learning
Awesome Nlp Polish
A curated list of resources dedicated to Natural Language Processing (NLP) in polish. Models, tools, datasets.
Stars: ✭ 153 (+22.4%)
Mutual labels:  datasets, nlp-machine-learning
Dan Jurafsky Chris Manning Nlp
My solution to the Natural Language Processing course made by Dan Jurafsky, Chris Manning in Winter 2012.
Stars: ✭ 124 (-0.8%)
Mutual labels:  information-retrieval, nlp-machine-learning
query-wellformedness
25,100 queries from the Paralex corpus (Fader et al., 2013) annotated with human ratings of whether they are well-formed natural language questions.
Stars: ✭ 80 (-36%)
Mutual labels:  information-retrieval, nlp-machine-learning
Openml R
R package to interface with OpenML
Stars: ✭ 81 (-35.2%)
Mutual labels:  benchmarking, datasets
Machine Learning Resources
A curated list of awesome machine learning frameworks, libraries, courses, books and many more.
Stars: ✭ 226 (+80.8%)
Mutual labels:  datasets, nlp-machine-learning
SENet-for-Weakly-Supervised-Relation-Extraction
No description or website provided.
Stars: ✭ 39 (-68.8%)
Mutual labels:  information-retrieval, nlp-machine-learning
spectrochempy
SpectroChemPy is a framework for processing, analyzing and modeling spectroscopic data for chemistry with Python
Stars: ✭ 34 (-72.8%)
Mutual labels:  datasets
datumaro
Dataset Management Framework, a Python library and a CLI tool to build, analyze and manage Computer Vision datasets.
Stars: ✭ 274 (+119.2%)
Mutual labels:  datasets
bumblebee
🚕 A spreadsheet-like data preparation web app that works over Optimus (Pandas, Dask, cuDF, Dask-cuDF, Spark and Vaex)
Stars: ✭ 120 (-4%)
Mutual labels:  datasets
memex-gate
General Architecture for Text Engineering
Stars: ✭ 47 (-62.4%)
Mutual labels:  information-retrieval
CVAE Dial
CVAE_XGate model in paper "Xu, Dusek, Konstas, Rieser. Better Conversations by Modeling, Filtering, and Optimizing for Coherence and Diversity"
Stars: ✭ 16 (-87.2%)
Mutual labels:  nlp-machine-learning
FieldedSDM
Fielded Sequential Dependence Model (code and runs)
Stars: ✭ 32 (-74.4%)
Mutual labels:  information-retrieval
NLP PEMDC
NLP Predtrained Embeddings, Models and Datasets Collections(NLP_PEMDC). The collection will keep updating.
Stars: ✭ 58 (-53.6%)
Mutual labels:  datasets
ezab
A suite of tools for benchmarking (load testing) web servers and databases
Stars: ✭ 16 (-87.2%)
Mutual labels:  benchmarking
TextFeatureSelection
Python library for feature selection for text features. It has filter method, genetic algorithm and TextFeatureSelectionEnsemble for improving text classification models. Helps improve your machine learning models
Stars: ✭ 42 (-66.4%)
Mutual labels:  nlp-machine-learning

Benchmark datasets for keyphrase extraction

This repository contains a large, curated set of benchmark datasets for evaluating automatic keyphrase extraction algorithms. These datasets are all pre-processed using the Stanford CoreNLP suite and are available in XML format.

Dataset format

All datasets are stored according to the following, common structure:

dataset/
       /test/       <- test documents
       /train/      <- training documents (if available)
       /dev/        <- validation documents (if available)
       /src/        <- everything used to build the dataset
       /references/ <- reference keyphrases in json format

Bigger datasets (such as KP20k, KPTimes) should be downloaded and preprocessed using the dataset/src directory.

Reference (gold annotation) format

Reference keyphrases, used for evaluating automatic keyphrase extraction algorithms, are available in json format and named according to the following rules: [split].[annotator].[stem]?.json

where

  • split corresponds to the dataset split: test, train, dev or valid
  • annotator is the type of annotation: author, reader, editor, combined, contr (controlled vocabulary), uncontr (free annotation)
  • stem (optional) indicates that stemming (using nltk Porter algorithm) is applied on reference keyphrases.

Below is a an example of reference file format:

{
    "doc-1": [
        [
            "target detect"
        ],
        [
            "number of sensor",
            "sensor number"
        ]
    ],
    ...
}

Available datasets

dataset lang nature train dev test Annotation #kp (test) #words (test)
CSTR [1] en Full papers 130 - 500 A 5.4 11501.4
NUS [3] en Full papers - - 211 A+R 11.0 8398.3
PubMed [5] en Full papers - - 1320 A 5.4 5322.9
ACM [6] en Full papers - - 2304 A 5.3 9197.6
Citeulike-180 [13] en Full papers - - 182 R 5.4 8589.7
SemEval-2010 [10] en Full papers 144 - 100 A+R 14.7 7961.2
KP20k [15] en Abstracts 527,090 20,000 20,000 A 176 5.3
Inspec [2] en Abstracts 1000 500 500 I (uncontr) 9.8 134.6
TALN-Archives [14] en/fr Abstracts - - 521/1207 A 4.0/4.1 123.1/141.0
KDD [9] en Abstracts - - 755 A 4.1 190.7
WWW [9] en Abstracts - - 1330 A 4.8 163.5
TermITH-Eval [11] fr Abstracts - - 400 I 11.8 164.7
KPTimes [16] en News 259,923 10,000 20,000 E 5.0 921
DUC-2001 [4] en News - - 308 R 8.1 847.2
500N-KPCrowd [7] en News 450 - 50 R 46.2 465.3
110-PT-BN-KP [12] pt News 100 - 10 R 27.6 439.4
Wikinews-Keyphrase [8] fr News - - 100 R 9.7 313.6

Annotation for gold keyphrases are performed by authors (A), readers (R), editors (E) or professional indexers (I).

References

  1. KEA: Practical automatic keyphrase extraction. Witten, I. H., Paynter, G. W., Frank, E., Gutwin, C., & Nevill-Manning, C. G. In Proceedings of the fourth ACM conference on Digital libraries. p. 254-255. 1999.

  2. Improved automatic keyword extraction given more linguistic knowledge. Anette Hulth. In Proceedings of EMNLP 2003. p. 216-223.

  3. Keyphrase Extraction in Scientific Publications. Thuy Dung Nguyen and Min-Yen Kan. In Proceedings of International Conference on Asian Digital Libraries 2007. p. 317-326.

  4. Single Document Keyphrase Extraction Using Neighborhood Knowledge. Xiaojun Wan and Jianguo Xiao. In Proceedings of AAAI 2008. pp. 855-860.

  5. Keyphrase extraction from single documents in the open domain exploiting linguistic and statistical methods. Alexander Thorsten Schutz. Master's thesis, National University of Ireland (2008).

  6. Large dataset for keyphrases extraction. Krapivin, M., Autaeu, A., & Marchese, M. (2009). University of Trento.

  7. Supervised Topical Key Phrase Extraction of News Stories using Crowdsourcing, Light Filtering and Co-reference Normalization. Marujo, L., Gershman, A., Carbonell, J., Frederking, R., & Neto, J. P. In Proceedings of LREC 2012.

  8. TopicRank: Graph-Based Topic Ranking for Keyphrase Extraction. Adrien Bougouin, Florian Boudin, Béatrice Daille. In Proceedings of the International Joint Conference on Natural Language Processing (IJCNLP), 2013.

  9. Citation-Enhanced Keyphrase Extraction from Research Papers: A Supervised Approach. Cornelia Caragea, Florin Bulgarov, Andreea Godea and Sujatha Das Gollapalli. In Proceedings of EMNLP 2014. pp. 1435-1446.

  10. How Document Pre-processing affects Keyphrase Extraction Performance. Florian Boudin, Hugo Mougard and Damien Cram. COLING 2016 Workshop on Noisy User-generated Text (WNUT).

  11. TermITH-Eval: a French Standard-Based Resource for Keyphrase Extraction Evaluation. Adrien Bougouin, Sabine Barreaux, Laurent Romary, Florian Boudin and​ Béatrice Daille. Language Resources and Evaluation Conference (LREC), 2016.

  12. Keyphrase Cloud Generation of Broadcast News. Luis Marujo, Márcio Viveiros, João Paulo da Silva Neto. In Proceedings of Interspeech 2011.

  13. Human-competitive tagging using automatic keyphrase extraction. O. Medelyan, E. Frank, I. H. Witten. In Proceedings of EMNLP 2009.

  14. TALN Archives: a digital archive of French research articles in Natural Language Processing. Florian Boudin. In Proceedings of TALN 2013.

  15. Deep Keyphrase Generation R. Meng, S. Zhao, S. Han, D. He, P. Brusilovsky and Y. Chi. In Proceedings of ACL 2017.

  16. KPTimes: A Large-Scale Dataset for Keyphrase Generation on News Documents. Y. Gallina, F. Boudin and B. Daille. In Proceedings of INLG 2019.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].