All Projects → asahi417 → kex

asahi417 / kex

Licence: MIT license
Kex is a python library for unsupervised keyword extraction from a document, providing an easy interface and benchmarks on 15 public datasets.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to kex

ake-datasets
Large, curated set of benchmark datasets for evaluating automatic keyphrase extraction algorithms.
Stars: ✭ 125 (+171.74%)
Mutual labels:  information-retrieval, keyword-extraction, nlp-machine-learning
query-wellformedness
25,100 queries from the Paralex corpus (Fader et al., 2013) annotated with human ratings of whether they are well-formed natural language questions.
Stars: ✭ 80 (+73.91%)
Mutual labels:  information-retrieval, nlp-machine-learning
perke
A keyphrase extractor for Persian
Stars: ✭ 60 (+30.43%)
Mutual labels:  information-retrieval, keyword-extraction
Dan Jurafsky Chris Manning Nlp
My solution to the Natural Language Processing course made by Dan Jurafsky, Chris Manning in Winter 2012.
Stars: ✭ 124 (+169.57%)
Mutual labels:  information-retrieval, nlp-machine-learning
SENet-for-Weakly-Supervised-Relation-Extraction
No description or website provided.
Stars: ✭ 39 (-15.22%)
Mutual labels:  information-retrieval, nlp-machine-learning
deep-semantic-code-search
Deep Semantic Code Search aims to explore a joint embedding space for code and description vectors and then use it for a code search application
Stars: ✭ 63 (+36.96%)
Mutual labels:  nlp-machine-learning
EMNLP2020
This is official Pytorch code and datasets of the paper "Where Are the Facts? Searching for Fact-checked Information to Alleviate the Spread of Fake News", EMNLP 2020.
Stars: ✭ 55 (+19.57%)
Mutual labels:  information-retrieval
Machine-Learning-Models
In This repository I made some simple to complex methods in machine learning. Here I try to build template style code.
Stars: ✭ 30 (-34.78%)
Mutual labels:  nlp-machine-learning
rust-stemmers
A rust implementation of some popular snowball stemming algorithms
Stars: ✭ 85 (+84.78%)
Mutual labels:  information-retrieval
Naive-Bayes-Evening-Workshop
Companion code for Introduction to Python for Data Science: Coding the Naive Bayes Algorithm evening workshop
Stars: ✭ 23 (-50%)
Mutual labels:  nlp-machine-learning
Multi-Type-TD-TSR
Extracting Tables from Document Images using a Multi-stage Pipeline for Table Detection and Table Structure Recognition:
Stars: ✭ 174 (+278.26%)
Mutual labels:  nlp-machine-learning
Quora QuestionPairs DL
Kaggle Competition: Using deep learning to solve quora's question pairs problem
Stars: ✭ 54 (+17.39%)
Mutual labels:  nlp-machine-learning
Deception-Detection-on-Amazon-reviews-dataset
A SVM model that classifies the reviews as real or fake. Used both the review text and the additional features contained in the data set to build a model that predicted with over 85% accuracy without using any deep learning techniques.
Stars: ✭ 42 (-8.7%)
Mutual labels:  nlp-machine-learning
MixGCF
MixGCF: An Improved Training Method for Graph Neural Network-based Recommender Systems, KDD2021
Stars: ✭ 73 (+58.7%)
Mutual labels:  information-retrieval
ProQA
Progressively Pretrained Dense Corpus Index for Open-Domain QA and Information Retrieval
Stars: ✭ 44 (-4.35%)
Mutual labels:  information-retrieval
3d model retriever
Experimenting with a newly published deep learning paper and how it can be used for content-based 3D model retrieval. (info retrieval for CAD)
Stars: ✭ 45 (-2.17%)
Mutual labels:  information-retrieval
DeepLearningReading
Deep Learning and Machine Learning mini-projects. Current Project: Deepmind Attentive Reader (rc-data)
Stars: ✭ 78 (+69.57%)
Mutual labels:  nlp-machine-learning
BERT-QE
Code and resources for the paper "BERT-QE: Contextualized Query Expansion for Document Re-ranking".
Stars: ✭ 43 (-6.52%)
Mutual labels:  information-retrieval
tagify
Tagify produces a set of tags from a given source. Source can be either an HTML page, a Markdown document or a plain text. Supports English, Russian, Chinese, Hindi, Spanish, Arabic, Japanese, German, Hebrew, French and Korean languages.
Stars: ✭ 24 (-47.83%)
Mutual labels:  keyword-extraction
naacl2018-fever
Fact Extraction and VERification baseline published in NAACL2018
Stars: ✭ 109 (+136.96%)
Mutual labels:  information-retrieval

license PyPI version PyPI pyversions PyPI status

KEX

Kex is a python library for unsurpervised keyword extractions, supporting the following features:

Our paper got accepted by EMNLP 2021 main conference 🎉 (camera-ready is here):
This paper has proposed three new algorithms (LexSpec, LexRank, TFIDFRank) and conducted an extensive comparison/analysis over existing keyword extraction algorithms with the proposed methods. Our algorithms are very simple and fast to compute yet established very strong baseline across the dataset (the best MRR/Precision@5 in the average over all the datasets). The TFIDFRank is based on the SingleRank algorithm but with the TFIDF as the population term and the LexSpec and LexRank are based on the lexical specificity where we write a short introduction to lexical specificity here as it is less popular than TFIDF. To reproduce all the results in the paper, please follow these instructions.

Get Started

Install via pip

pip install kex

Extract Keywords with Kex

Built-in algorithms in kex is below:

Basic usage:

>>> import kex
>>> model = kex.SingleRank()  # any algorithm listed above
>>> sample = '''
We propose a novel unsupervised keyphrase extraction approach that filters candidate keywords using outlier detection.
It starts by training word embeddings on the target document to capture semantic regularities among the words. It then
uses the minimum covariance determinant estimator to model the distribution of non-keyphrase word vectors, under the
assumption that these vectors come from the same distribution, indicative of their irrelevance to the semantics
expressed by the dimensions of the learned vector representation. Candidate keyphrases only consist of words that are
detected as outliers of this dominant distribution. Empirical results show that our approach outperforms state
of-the-art and recent unsupervised keyphrase extraction methods.
'''
>>> model.get_keywords(sample, n_keywords=2)
[{'stemmed': 'non-keyphras word vector',
  'pos': 'ADJ NOUN NOUN',
  'raw': ['non-keyphrase word vectors'],
  'offset': [[47, 49]],
  'count': 1,
  'score': 0.06874471825637762,
  'n_source_tokens': 112},
 {'stemmed': 'semant regular word',
  'pos': 'ADJ NOUN NOUN',
  'raw': ['semantic regularities words'],
  'offset': [[28, 32]],
  'count': 1,
  'score': 0.06001468574146248,
  'n_source_tokens': 112}]

Compute a statistical prior

Algorithms such as TF, TFIDF, TFIDFRank, LexSpec, LexRank, TopicalPageRank, and SingleTPR need to compute a prior distribution beforehand by

>>> import kex
>>> model = kex.SingleTPR()
>>> test_sentences = ['documentA', 'documentB', 'documentC']
>>> model.train(test_sentences, export_directory='./tmp')

Priors are cached and can be loaded on the fly as

>>> import kex
>>> model = kex.SingleTPR()
>>> model.load('./tmp')

Supported language

Currently algorithms are available only in English, but soon we will relax the constrain to allow other language to be supported.

Benchmark on 15 Public Datasets

Users can fetch 15 public keyword extraction datasets via kex.get_benchmark_dataset.

>>> import kex
>>> json_line, language = kex.get_benchmark_dataset('Inspec')
>>> json_line[0]
{
    'keywords': ['kind infer', 'type check', 'overload', 'nonstrict pure function program languag', ...],
    'source': 'A static semantics for Haskell\nThis paper gives a static semantics for Haskell 98, a non-strict ...',
    'id': '1053.txt'
}

Please take a look an example script to run a benchmark on those datasets.

Implement Custom Extractor with Kex

We provide an API to run a basic pipeline for preprocessing, by which one can implement a custom keyword extractor.

import kex

class CustomExtractor:
    """ Custom keyword extractor example: First N keywords extractor """

    def __init__(self, maximum_word_number: int = 3):
        """ First N keywords extractor """
        self.phrase_constructor = kex.PhraseConstructor(maximum_word_number=maximum_word_number)

    def get_keywords(self, document: str, n_keywords: int = 10):
        """ Get keywords

         Parameter
        ------------------
        document: str
        n_keywords: int

         Return
        ------------------
        a list of dictionary consisting of 'stemmed', 'pos', 'raw', 'offset', 'count'.
        eg) {'stemmed': 'grid comput', 'pos': 'ADJ NOUN', 'raw': ['grid computing'], 'offset': [[11, 12]], 'count': 1}
        """
        phrase_instance, stemmed_tokens = self.phrase_constructor.tokenize_and_stem_and_phrase(document)
        sorted_phrases = sorted(phrase_instance.values(), key=lambda x: x['offset'][0][0])
        return sorted_phrases[:min(len(sorted_phrases), n_keywords)]

Reference paper

If you use any of these resources, please cite the following paper:

@inproceedings{ushio-etal-2021-kex,
    title={{B}ack to the {B}asics: {A} {Q}uantitative {A}nalysis of {S}tatistical and {G}raph-{B}ased {T}erm {W}eighting {S}chemes for {K}eyword {E}xtraction},
    author={Ushio, Asahi and Liberatore, Federico and Camacho-Collados, Jose},
        booktitle={Proceedings of the {EMNLP} 2021 Main Conference},
    year = {2021},
    publisher={Association for Computational Linguistics}
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].