All Projects → crabcamp → Lexrank

crabcamp / Lexrank

Licence: mit
LexRank algorithm for text summarization

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Lexrank

Textrank
🌀 ⚡️ 🌍 TextRank (automatic text summarization) for PHP8
Stars: ✭ 193 (+78.7%)
Mutual labels:  algorithm, summarization
Sumy
Module for automatic summarization of text documents and HTML pages.
Stars: ✭ 2,705 (+2404.63%)
Mutual labels:  summary, summarization
Delaunator
An incredibly fast JavaScript library for Delaunay triangulation of 2D points
Stars: ✭ 1,641 (+1419.44%)
Mutual labels:  algorithm
Xseries
Library for cross-version Minecraft Bukkit support and various efficient API methods.
Stars: ✭ 109 (+0.93%)
Mutual labels:  algorithm
Ml Ai Experiments
All my experiments with AI and ML
Stars: ✭ 107 (-0.93%)
Mutual labels:  text
Render
Go package for easily rendering JSON, XML, binary data, and HTML templates responses.
Stars: ✭ 1,562 (+1346.3%)
Mutual labels:  text
Hnswlib
Java library for approximate nearest neighbors search using Hierarchical Navigable Small World graphs
Stars: ✭ 108 (+0%)
Mutual labels:  algorithm
Spring Summary
토비의 스프링 3.1 서적과 백기선님의 강좌를 토대로 스프링의 핵심 기술을 정리했습니다.
Stars: ✭ 106 (-1.85%)
Mutual labels:  summary
Basicknowledge
Data struct, algorithm, LeetCode and DesignPattern introduction and implementation in Cpp and C#
Stars: ✭ 109 (+0.93%)
Mutual labels:  algorithm
Word Wrap
Wrap words to a specified length.
Stars: ✭ 107 (-0.93%)
Mutual labels:  text
Swarmz
A free, header-only C++ swarming (flocking) library for real-time applications
Stars: ✭ 108 (+0%)
Mutual labels:  algorithm
Any Angle Pathfinding
A collection of algorithms used for any-angle pathfinding with visualisations.
Stars: ✭ 107 (-0.93%)
Mutual labels:  algorithm
Go Algorithms
Algorithms and data structures for golang
Stars: ✭ 1,529 (+1315.74%)
Mutual labels:  algorithm
Render
Universal data-driven template for generating textual output, as a static binary and a library
Stars: ✭ 108 (+0%)
Mutual labels:  text
Codelibrary
💎Collection of algorithms and data structures
Stars: ✭ 1,585 (+1367.59%)
Mutual labels:  algorithm
What I Have Read
Paper Lists, Notes and Slides, Focus on NLP. For summarization, please refer to https://github.com/xcfcode/Summarization-Papers
Stars: ✭ 110 (+1.85%)
Mutual labels:  summarization
Three Spritetext
A sprite based text component for ThreeJS
Stars: ✭ 106 (-1.85%)
Mutual labels:  text
Minperf
A Minimal Perfect Hash Function Library
Stars: ✭ 107 (-0.93%)
Mutual labels:  algorithm
Leetcode
LeetCode Solutions: A Record of My Problem Solving Journey.( leetcode题解,记录自己的leetcode解题之路。)
Stars: ✭ 45,650 (+42168.52%)
Mutual labels:  algorithm
Mygo
Leetcode、剑指offer(第二版)的Go实现😀 Come join us!🤝❤️👻
Stars: ✭ 109 (+0.93%)
Mutual labels:  algorithm

lexrank

LexRank algorithm for text summarization

.. image:: https://travis-ci.org/wikibusiness/lexrank.svg?branch=dev :target: https://travis-ci.org/wikibusiness/lexrank

.. image:: https://badge.fury.io/py/lexrank.svg :target: https://badge.fury.io/py/lexrank

Info

LexRank is an unsupervised approach to text summarization based on graph-based centrality scoring of sentences. The main idea is that sentences "recommend" other similar sentences to the reader. Thus, if one sentence is very similar to many others, it will likely be a sentence of great importance. The importance of this sentence also stems from the importance of the sentences "recommending" it. Thus, to get ranked highly and placed in a summary, a sentence must be similar to many sentences that are in turn also similar to many other sentences. This makes intuitive sense and allows the algorithms to be applied to any arbitrary new text.

Installation

.. code-block:: shell

pip install lexrank

Usage

In the following example we use BBC news dataset <http://mlg.ucd.ie/files/datasets/bbc-fulltext.zip>_ as a corpus of documents.

.. code-block:: python

from lexrank import LexRank
from lexrank.mappings.stopwords import STOPWORDS
from path import Path

documents = []
documents_dir = Path('bbc/politics')

for file_path in documents_dir.files('*.txt'):
    with file_path.open(mode='rt', encoding='utf-8') as fp:
        documents.append(fp.readlines())

lxr = LexRank(documents, stopwords=STOPWORDS['en'])

sentences = [
    'One of David Cameron\'s closest friends and Conservative allies, '
    'George Osborne rose rapidly after becoming MP for Tatton in 2001.',

    'Michael Howard promoted him from shadow chief secretary to the '
    'Treasury to shadow chancellor in May 2005, at the age of 34.',

    'Mr Osborne took a key role in the election campaign and has been at '
    'the forefront of the debate on how to deal with the recession and '
    'the UK\'s spending deficit.',

    'Even before Mr Cameron became leader the two were being likened to '
    'Labour\'s Blair/Brown duo. The two have emulated them by becoming '
    'prime minister and chancellor, but will want to avoid the spats.',

    'Before entering Parliament, he was a special adviser in the '
    'agriculture department when the Tories were in government and later '
    'served as political secretary to William Hague.',

    'The BBC understands that as chancellor, Mr Osborne, along with the '
    'Treasury will retain responsibility for overseeing banks and '
    'financial regulation.',

    'Mr Osborne said the coalition government was planning to change the '
    'tax system \"to make it fairer for people on low and middle '
    'incomes\", and undertake \"long-term structural reform\" of the '
    'banking sector, education and the welfare state.',
]

# get summary with classical LexRank algorithm
summary = lxr.get_summary(sentences, summary_size=2, threshold=.1)
print(summary)

# ['Mr Osborne said the coalition government was planning to change the tax '
#  'system "to make it fairer for people on low and middle incomes", and '
#  'undertake "long-term structural reform" of the banking sector, education and '
#  'the welfare state.',
#  'The BBC understands that as chancellor, Mr Osborne, along with the Treasury '
#  'will retain responsibility for overseeing banks and financial regulation.']


# get summary with continuous LexRank
summary_cont = lxr.get_summary(sentences, threshold=None)
print(summary_cont)

# ['The BBC understands that as chancellor, Mr Osborne, along with the Treasury '
#  'will retain responsibility for overseeing banks and financial regulation.']

# get LexRank scores for sentences
# 'fast_power_method' speeds up the calculation, but requires more RAM
scores_cont = lxr.rank_sentences(
    sentences,
    threshold=None,
    fast_power_method=False,
)
print(scores_cont)

#  [1.0896493024505858,
#  0.9010711968859021,
#  1.1139166497016315,
#  0.8279523250808547,
#  0.8112028559566362,
#  1.185228912485382,
#  1.0709787574388283]

Stop words for 22 languages are included into the package. To define your own mapping of stop words, prepare text files with utf-8 encoding where words are separated by newlines. Then use the command

.. code-block:: bash

lexrank_assemble_stopwords --source_dir directory_with_txt_files

that replaces the default mapping. Note that names of .txt files are used as keys in STOPWORDS dictionary.

Customization

The straightforward implementation of LexRank algorithm described above may be insufficient for modern NLP tasks. It cannot be used with sentence embeddings or custom tf-idf vectors, prepared with different third-party software. Therefore we provide a core function to work with similarity matrix of sentences.

.. code-block:: python

import numpy as np
from lexrank import degree_centrality_scores

similarity_matrix = np.array(
    [
        [1.00, 0.17, 0.02, 0.03, 0.00, 0.01, 0.00, 0.17, 0.03, 0.00, 0.00],
        [0.17, 1.00, 0.32, 0.19, 0.02, 0.03, 0.03, 0.04, 0.01, 0.02, 0.01],
        [0.02, 0.32, 1.00, 0.13, 0.02, 0.02, 0.05, 0.05, 0.01, 0.03, 0.02],
        [0.03, 0.19, 0.13, 1.00, 0.05, 0.05, 0.19, 0.06, 0.05, 0.06, 0.03],
        [0.00, 0.02, 0.02, 0.05, 1.00, 0.33, 0.09, 0.05, 0.03, 0.03, 0.06],
        [0.01, 0.03, 0.02, 0.05, 0.33, 1.00, 0.09, 0.04, 0.06, 0.08, 0.04],
        [0.00, 0.03, 0.05, 0.19, 0.09, 0.09, 1.00, 0.05, 0.01, 0.01, 0.01],
        [0.17, 0.04, 0.05, 0.06, 0.05, 0.04, 0.05, 1.00, 0.04, 0.05, 0.04],
        [0.03, 0.01, 0.01, 0.05, 0.03, 0.06, 0.01, 0.04, 1.00, 0.20, 0.24],
        [0.00, 0.02, 0.03, 0.06, 0.03, 0.08, 0.01, 0.05, 0.20, 1.00, 0.10],
        [0.00, 0.01, 0.02, 0.03, 0.06, 0.04, 0.01, 0.04, 0.24, 0.10, 1.00],
    ],
)

# scores calculated with classical LexRank algorithm
degree_centrality_scores(similarity_matrix, thershold=.1)

# array([0.66666667, 1.        , 1.11111111, 1.22222222, 1.11111111,
#        1.11111111, 0.77777778, 1.22222222, 0.88888889, 1.        ,
#        0.88888889])

# scores by continuous LexRank
degree_centrality_scores(similarity_matrix, thershold=None)

# array([0.86714443, 1.11576626, 1.01267916, 1.11576626, 1.01874311,
#        1.06119074, 0.9277839 , 0.96416759, 1.01874311, 0.95810364,
#        0.9399118 ])

The function :code:degree_centrality_scores takes as input a similarity matrix so it is not restricted to NLP only. It can be used for any objects if exists a proper way to measure their similarity. When creating a custom :code:similarity_matrix it is necessary to ensure that all its values are in range [0, 1].

Tests

Tests are not supplied with the package, to run them you need to clone the repository and install additional dependencies.

.. code-block:: bash

# ensure virtualenv is activated
make install-dev

Run linter and tests

.. code-block:: bash

make lint
make test

References

Güneş Erkan and Dragomir R. Radev: LexRank: Graph-based Lexical Centrality as Salience in Text Summarization <http://www.jair.org/papers/paper1523.html>_.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].