All Projects → SkBlaz → Rakun

SkBlaz / Rakun

Licence: gpl-3.0
Rank-based Unsupervised Keyword Extraction via Metavertex Aggregation

Programming Languages

c
50402 projects - #5 most used programming language

Projects that are alternatives of or similar to Rakun

Simclr
PyTorch implementation of SimCLR: A Simple Framework for Contrastive Learning of Visual Representations
Stars: ✭ 750 (+1288.89%)
Mutual labels:  unsupervised-learning
Discogan Pytorch
PyTorch implementation of "Learning to Discover Cross-Domain Relations with Generative Adversarial Networks"
Stars: ✭ 961 (+1679.63%)
Mutual labels:  unsupervised-learning
Student Teacher Anomaly Detection
Student–Teacher Anomaly Detection with Discriminative Latent Embeddings
Stars: ✭ 43 (-20.37%)
Mutual labels:  unsupervised-learning
Variational Autoencoder
Variational autoencoder implemented in tensorflow and pytorch (including inverse autoregressive flow)
Stars: ✭ 807 (+1394.44%)
Mutual labels:  unsupervised-learning
Domain Transfer Network
TensorFlow Implementation of Unsupervised Cross-Domain Image Generation
Stars: ✭ 850 (+1474.07%)
Mutual labels:  unsupervised-learning
Iva
IVA: Independent Vector Analysis implementation
Stars: ✭ 35 (-35.19%)
Mutual labels:  unsupervised-learning
Context Encoder
[CVPR 2016] Unsupervised Feature Learning by Image Inpainting using GANs
Stars: ✭ 731 (+1253.7%)
Mutual labels:  unsupervised-learning
Voxelmorph
Unsupervised Learning for Image Registration
Stars: ✭ 1,057 (+1857.41%)
Mutual labels:  unsupervised-learning
Transferlearning
Transfer learning / domain adaptation / domain generalization / multi-task learning etc. Papers, codes, datasets, applications, tutorials.-迁移学习
Stars: ✭ 8,481 (+15605.56%)
Mutual labels:  unsupervised-learning
Susi
SuSi: Python package for unsupervised, supervised and semi-supervised self-organizing maps (SOM)
Stars: ✭ 42 (-22.22%)
Mutual labels:  unsupervised-learning
Bagofconcepts
Python implementation of bag-of-concepts
Stars: ✭ 18 (-66.67%)
Mutual labels:  unsupervised-learning
Summary loop
Codebase for the Summary Loop paper at ACL2020
Stars: ✭ 26 (-51.85%)
Mutual labels:  unsupervised-learning
Gdynet
Unsupervised learning of atomic scale dynamics from molecular dynamics.
Stars: ✭ 37 (-31.48%)
Mutual labels:  unsupervised-learning
Minisom
🔴 MiniSom is a minimalistic implementation of the Self Organizing Maps
Stars: ✭ 801 (+1383.33%)
Mutual labels:  unsupervised-learning
Tadw
An implementation of "Network Representation Learning with Rich Text Information" (IJCAI '15).
Stars: ✭ 43 (-20.37%)
Mutual labels:  unsupervised-learning
Awesome Artificial Intelligence
A curated list of Artificial Intelligence (AI) courses, books, video lectures and papers.
Stars: ✭ 6,516 (+11966.67%)
Mutual labels:  unsupervised-learning
Uc Davis Cs Exams Analysis
📈 Regression and Classification with UC Davis student quiz data and exam data
Stars: ✭ 33 (-38.89%)
Mutual labels:  unsupervised-learning
Lir For Unsupervised Ir
This is an implementation for the CVPR2020 paper "Learning Invariant Representation for Unsupervised Image Restoration"
Stars: ✭ 53 (-1.85%)
Mutual labels:  unsupervised-learning
Php Ml
PHP-ML - Machine Learning library for PHP
Stars: ✭ 7,900 (+14529.63%)
Mutual labels:  unsupervised-learning
Unsuprevised seg via cnn
Stars: ✭ 38 (-29.63%)
Mutual labels:  unsupervised-learning

RaKUn algorithm

This is the official repository of RaKUn. This keyword detection algorithm exploits graph-based language representations for efficient denoising and keyword detection. Key ideas of RaKUn:
  1. Transform texts to graphs
  2. Prune graphs based on token similarity (meta vertex introduction)
  3. Rank nodes -> keywords

This repo now also includes extensive evaluation functionality, which can be of use to the community.

A note

RaKUn is extremely crude, and thus one of the fastest methods. To achieve sota performance, grid search across the hyperparameter space is suggested, as the parameters can vary a lot w.r.t. domain of the application.

The core functionality

Packed as a Python library, it can be installed via:

pip3 install mrakun

or

python3 setup.py install

Requirements

To install the required libraries, we suggest:

pip3 install -r requirements.txt

For visualization, one also needs Py3plex library (>=0.66). (pip3 install py3plex)

Tests

To test whether the core functionality is ok, you can run

python3 -m pytest tests/test_core_functionality.py

Note that you need Py3plex library for both tests to pass!

Usage with editdistance

Using RaKUn is simple! Simply call the main detector method with optional arguments (as described in the paper)

Let's as the first example consider a text block from the Wiki about Brexit.

from mrakun import RakunDetector
from nltk.corpus import stopwords


blob_of_text = "Brexit (/ˈbrɛksɪt, ˈbrɛɡzɪt/;[1] a portmanteau of \"British\" and \"exit\") is the scheduled withdrawal of the United Kingdom (UK) from the European Union (EU). Following a June 2016 referendum, in which 51.9% voted to leave, the UK government formally announced the country's withdrawal in March 2017, starting a two-year process that was due to conclude with the UK withdrawing on 29 March 2019. As the UK parliament thrice voted against the negotiated withdrawal agreement, that deadline has been extended twice, and is currently 31 October 2019.[2][3] An Act of Parliament requires the government to seek a third extension if no agreement is reached before 19 October. Withdrawal is advocated by Eurosceptics and opposed by pro-Europeanists, both of whom span the political spectrum. The UK joined the European Communities (EC) in 1973, with continued membership endorsed in a 1975 referendum. In the 1970s and 1980s, withdrawal from the EC was advocated mainly by the political left, e.g. in the Labour Party's 1983 election manifesto. From the 1990s, the eurosceptic wing of the Conservative Party grew, and led a rebellion over ratification of the 1992 Maastricht Treaty that established the EU. In parallel with the UK Independence Party (UKIP), and the cross-party People's Pledge campaign, it pressured Conservative Prime Minister David Cameron to hold a referendum on continued EU membership. Cameron, who had campaigned to remain, resigned after the result and was succeeded by Theresa May. On 29 March 2017, the UK government invoked Article 50 of the Treaty on European Union, formally starting the withdrawal. May called a snap general election in June 2017, which resulted in a Conservative minority government supported by the Democratic Unionist Party. UK–EU withdrawal negotiations began later that month. The UK negotiated to leave the EU customs union and single market. This resulted in the November 2018 withdrawal agreement, but the UK parliament voted against ratifying it three times. The Labour Party wanted any agreement to maintain a customs union, while many Conservatives opposed the agreement's financial settlement on the UK's share of EU financial obligations, as well as the Irish backstop designed to prevent border controls in Ireland. The Liberal Democrats, Scottish National Party and others seek to reverse Brexit through a second referendum. The EU has declined a re-negotiation that omits the backstop. In March 2019, the UK parliament voted for May to ask the EU to delay Brexit until October. Having failed to pass her agreement, May resigned as Prime Minister in July and was succeeded by Boris Johnson. He sought to replace parts of the agreement and vowed to leave the EU by the new deadline, with or without an agreement."

hyperparameters = {"distance_threshold":2,
                   "distance_method": "editdistance",
                   "num_keywords" : 10,
                   "pair_diff_length":2,
                   "stopwords" : stopwords.words('english'),
                   "bigram_count_threshold":2,
                   "num_tokens":[1,2],
		   "max_similar" : 3, ## n most similar can show up n times
		   "max_occurrence" : 3} ## maximum frequency overall

keyword_detector = RakunDetector(hyperparameters)
keywords = keyword_detector.find_keywords(blob_of_text, input_type = "text")
print(keywords)
keyword_detector.visualize_network()

The keywords identified are:

[('withdrawal', 0.24997620407386256), ('party', 0.1800875690081858), ('agreement', 0.1702598515134209), ('voted', 0.12316771368741672), ('advocated', 0.10108509423186751), ('referendum', 0.08980582524271845), ('political', 0.08756900818579859), ('formally', 0.08614125261755187), ('union', 0.08457072149248049), ('uk–eu', 0.08397582333904435)]

Visualized as:

Keyword graph

Other use:

from mrakun import RakunDetector
from nltk.corpus import stopwords

hyperparameters = {"distance_threshold":3,
                   "num_keywords" : 10,
                   "distance_method": "editdistance",
                   "pair_diff_length":2,
                   "stopwords" : stopwords.words('english'),
                   "bigram_count_threshold":2,
                   "num_tokens":[1]}

keyword_detector = RakunDetector(hyperparameters)
example_data = "./datasets/wiki20/docsutf8/7183.txt"
keywords = keyword_detector.find_keywords(example_data)
print(keywords)

keyword_detector.verbose = False

Two results are returned. First one are keywords with corresponding centrality scores, e.g.,

[('knowledge', 0.09681271338743186), ('system', 0.07564796507791872), ('level', 0.05109912821797258), ('model', 0.04258209551663402), ('domain', 0.04148756282878477), ('task', 0.03965601439030798), ('structure', 0.03819960122342131), ('1989', 0.03287462707574183), ('diagnosis', 0.03236736673125384), ('method', 0.030969444684564095)]

And the plot of the keyword graph:

## once the find_keywords() was invoked
keyword_detector.visualize_network()

Keyword graph

Hyperparameter explanation

Consider the following hyperparameters:

hyperparameters = {"distance_threshold":3,
                   "num_keywords" : 10,
                   "distance_method": "editdistance",
                   "pair_diff_length":2,
                   "stopwords" : stopwords.words('english'),
                   "bigram_count_threshold":2,
                   "num_tokens":[1],
		   "max_similar": 2,
		   "max_occurrence": 2}
Hyperparameter Description
distance_threshold Distance between tokens that initiates the merge process (if more similar than this, the tokens are merged)
num_keywords The number of keywords to be detected
pair_diff_length If the difference in the length of the two tokens is smaller than this parameter, the tokens are considered for merging.
num_tokens The number of tokens that can constitute a keyword (1,2 or 3). This is a list, as e.g., [1,2] implies tokens of length 1 as well as pairs of tokens are considered as keywords.
max_similar How many similar keywords are permitted. For example, "british vote" and "british parliament" would be considered similar (overlap of at least one token). This way, too similar keywords can be prunned out.
max_occurrence How many of the most common keywords are to be considered during max_similar prunning step. Intuitively, only a handful of keywords are worthy of prunning out if appearing too much (e.g., "parliament").
lemmatizer The e.g., NLTK lemmatizer instance -> this lemmatizes tokens during graph construction.

Usage with fasttext

Using RaKUn with fasttext requires pretrained emmbeding model. Download .bin file from https://github.com/facebookresearch/fastText/blob/master/docs/pretrained-vectors.md for chosen language and save it.

from mrakun import RakunDetector
from nltk.corpus import stopwords

hyperparameters = {"distance_threshold":0.2,
                   "distance_method": "fasttext",
                   "pretrained_embedding_path": '../pretrained_models/fasttext/wiki.en.bin', #change path accordingly
                   "num_keywords" : 10,
                   "pair_diff_length":2,
                   "stopwords" : stopwords.words('english'),
                   "bigram_count_threshold":2,
                   "num_tokens":[1]}

keyword_detector = RakunDetector(hyperparameters)
example_data = "./datasets/wiki20/docsutf8/7183.txt"
keywords = keyword_detector.find_keywords(example_data)
print(keywords)

keyword_detector.verbose = False

Two results are returned. First one are keywords with corresponding centrality scores, e.g.,

[('structure', 0.013043478260869565), ('model', 0.010507246376811594), ('level', 0.009420289855072464), ('knowledge', 0.006400966183574879), ('system', 0.005595813204508856), ('application', 0.004589371980676328), ('inference', 0.002657004830917874), ('reasoning', 0.002536231884057971), ('abstraction', 0.0020933977455716585), ('problem', 0.0020933977455716585)]

And the plot of the keyword graph:

## once the find_keywords() was invoked
keyword_detector.visualize_network()

Keyword graph

Acknowledgements

The logo was created by Luka Skrlj

Citation

@InProceedings{10.1007/978-3-030-31372-2_26,
author="{\v{S}}krlj, Bla{\v{z}}
and Repar, Andra{\v{z}}
and Pollak, Senja",
editor="Mart{\'i}n-Vide, Carlos
and Purver, Matthew
and Pollak, Senja",
title="RaKUn: Rank-based Keyword Extraction via Unsupervised Learning and Meta Vertex Aggregation",
booktitle="Statistical Language and Speech Processing",
year="2019",
publisher="Springer International Publishing",
address="Cham",
pages="311--323",
abstract="Keyword extraction is used for summarizing the content of a document and supports efficient document retrieval, and is as such an indispensable part of modern text-based systems. We explore how load centrality, a graph-theoretic measure applied to graphs derived from a given text can be used to efficiently identify and rank keywords. Introducing meta vertices (aggregates of existing vertices) and systematic redundancy filters, the proposed method performs on par with state-of-the-art for the keyword extraction task on 14 diverse datasets. The proposed method is unsupervised, interpretable and can also be used for document visualization.",
isbn="978-3-030-31372-2"
}

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].