All Projects → andrewtavis → kwx

andrewtavis / kwx

Licence: BSD-3-Clause License
BERT, LDA, and TFIDF based keyword extraction in Python

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to kwx

Text mining resources
Resources for learning about Text Mining and Natural Language Processing
Stars: ✭ 358 (+984.85%)
Mutual labels:  text-mining, text-classification, text-analysis, topic-modeling
Awesome Text Classification
Awesome-Text-Classification Projects,Papers,Tutorial .
Stars: ✭ 158 (+378.79%)
Mutual labels:  text-mining, text-classification, text-analysis
Artificial Adversary
🗣️ Tool to generate adversarial text examples and test machine learning models against them
Stars: ✭ 348 (+954.55%)
Mutual labels:  text-mining, text-classification, text-analysis
policy-data-analyzer
Building a model to recognize incentives for landscape restoration in environmental policies from Latin America, the US and India. Bringing NLP to the world of policy analysis through an extensible framework that includes scraping, preprocessing, active learning and text analysis pipelines.
Stars: ✭ 22 (-33.33%)
Mutual labels:  text-classification, lda, bert
lda2vec
Mixing Dirichlet Topic Models and Word Embeddings to Make lda2vec from this paper https://arxiv.org/abs/1605.02019
Stars: ✭ 27 (-18.18%)
Mutual labels:  text-mining, topic-modeling, lda
Product-Categorization-NLP
Multi-Class Text Classification for products based on their description with Machine Learning algorithms and Neural Networks (MLP, CNN, Distilbert).
Stars: ✭ 30 (-9.09%)
Mutual labels:  text-classification, topic-modeling, data-analysis
Text-Analysis
Explaining textual analysis tools in Python. Including Preprocessing, Skip Gram (word2vec), and Topic Modelling.
Stars: ✭ 48 (+45.45%)
Mutual labels:  text-mining, lda, latent-dirichlet-allocation
tomoto-ruby
High performance topic modeling for Ruby
Stars: ✭ 49 (+48.48%)
Mutual labels:  topic-modeling, lda, latent-dirichlet-allocation
NMFADMM
A sparsity aware implementation of "Alternating Direction Method of Multipliers for Non-Negative Matrix Factorization with the Beta-Divergence" (ICASSP 2014).
Stars: ✭ 39 (+18.18%)
Mutual labels:  topic-modeling, lda, unsupervised-learning
text-analysis
Weaving analytical stories from text data
Stars: ✭ 12 (-63.64%)
Mutual labels:  text-mining, text-analysis, topic-modeling
Lda Topic Modeling
A PureScript, browser-based implementation of LDA topic modeling.
Stars: ✭ 91 (+175.76%)
Mutual labels:  text-mining, topic-modeling, lda
DaDengAndHisPython
【微信公众号:大邓和他的python】, Python语法快速入门https://www.bilibili.com/video/av44384851 Python网络爬虫快速入门https://www.bilibili.com/video/av72010301, 我的联系邮箱[email protected]
Stars: ✭ 59 (+78.79%)
Mutual labels:  text-mining, text-classification, text-analysis
Fake news detection
Fake News Detection in Python
Stars: ✭ 194 (+487.88%)
Mutual labels:  text-mining, text-classification, text-analysis
PyLDA
A Latent Dirichlet Allocation implementation in Python.
Stars: ✭ 51 (+54.55%)
Mutual labels:  topic-modeling, lda, latent-dirichlet-allocation
support-tickets-classification
This case study shows how to create a model for text analysis and classification and deploy it as a web service in Azure cloud in order to automatically classify support tickets. This project is a proof of concept made by Microsoft (Commercial Software Engineering team) in collaboration with Endava http://endava.com/en
Stars: ✭ 142 (+330.3%)
Mutual labels:  text-mining, text-classification, text-analysis
KGE-LDA
Knowledge Graph Embedding LDA. AAAI 2017
Stars: ✭ 35 (+6.06%)
Mutual labels:  topic-modeling, lda
learning-stm
Learning structural topic modeling using the stm R package.
Stars: ✭ 103 (+212.12%)
Mutual labels:  text-analysis, topic-modeling
ganbert-pytorch
Enhancing the BERT training with Semi-supervised Generative Adversarial Networks in Pytorch/HuggingFace
Stars: ✭ 60 (+81.82%)
Mutual labels:  text-classification, bert
WSDM-Cup-2019
[ACM-WSDM] 3rd place solution at WSDM Cup 2019, Fake News Classification on Kaggle.
Stars: ✭ 62 (+87.88%)
Mutual labels:  text-classification, bert
KAREN
KAREN: Unifying Hatespeech Detection and Benchmarking
Stars: ✭ 18 (-45.45%)
Mutual labels:  tfidf, bert

rtd ci codecov pyversions pypi pypistatus license coc codestyle colab

BERT, LDA, and TFIDF based keyword extraction in Python

kwx is a toolkit for multilingual keyword extraction based on Google's BERT, Latent Dirichlet Allocation and Term Frequency Inverse Document Frequency. The package provides a suite of methods to process texts of any language to varying degrees and then extract and analyze keywords from the created corpus (see kwx.languages for the various degrees of language support). A unique focus is allowing users to decide which words to not include in outputs, thereby guaranteeing sensible results that are in line with user intuitions.

For a thorough overview of the process and techniques see the Google slides, and reference the documentation for explanations of the models and visualization methods.

Contents

Installation

kwx can be downloaded from PyPI via pip or sourced directly from this repository:

pip install kwx
git clone https://github.com/andrewtavis/kwx.git
cd kwx
python setup.py install
import kwx

Models

Implemented NLP modeling methods within kwx.model include:

• BERT

Bidirectional Encoder Representations from Transformers derives representations of words based on nlp models ran over open-source Wikipedia data. These representations are then leveraged to derive corpus topics.

kwx uses sentence-transformers pretrained models. See their GitHub and documentation for the available models.

• LDA

Latent Dirichlet Allocation is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. In the case of kwx, documents or text entries are posited to be a mixture of a given number of topics, and the presence of each word in a text body comes from its relation to these derived topics.

Although not as computationally robust as some machine learning models, LDA provides quick results that are suitable for many applications. Specifically for keyword extraction, in most settings the results are similar to those of BERT in a fraction of the time.

• TFIDF

The user can also compute Term Frequency Inverse Document Frequency keywords - those that are unique in a text body in comparison to another that's compared. This is a useful baseline when a user has another text or text body to compare the target corpus against.

• Word Frequency

Finally a user can simply query the most common words from a text corpus. This method is used in kwx as a baseline to check model efficacy.

Usage

Keyword extraction can be useful to analyze surveys, tweets and other kinds of social media posts, research papers, and further classes of texts. examples/kw_extraction provides an example of how to use kwx by deriving keywords from tweets in the Kaggle Twitter US Airline Sentiment dataset.

The following outlines using kwx to derive keywords from a text corpus with prompt_remove_words as True (the user will be asked if some of the extracted words need to be replaced):

• Text Cleaning

from kwx.utils import prepare_data

input_language = "english" # see kwx.languages for options

# kwx.utils.clean() can be used on a list of lists
text_corpus = prepare_data(
    data="df_or_csv_xlsx_path",
    target_cols="cols_where_texts_are",
    input_language=input_language,
    min_token_freq=0,  # for BERT
    min_token_len=0,  # for BERT
    remove_stopwords=False,  # for BERT
    verbose=True,
)

• Keyword Extraction

from kwx.model import extract_kws

num_keywords = 15
num_topics = 10
ignore_words = ["words", "user", "knows", "they", "don't", "want"]

# Remove n-grams for BERT training
corpus_no_ngrams = [
    " ".join([t for t in text.split(" ") if "_" not in t]) for text in text_corpus
]

# We can pass keywords for sentence_transformers.SentenceTransformer.encode,
# gensim.models.ldamulticore.LdaMulticore, or sklearn.feature_extraction.text.TfidfVectorizer
bert_kws = extract_kws(
    method="BERT", # "BERT", "LDA", "TFIDF", "frequency"
    bert_st_model="xlm-r-bert-base-nli-stsb-mean-tokens",
    text_corpus=corpus_no_ngrams,  # automatically tokenized if using LDA
    input_language=input_language,
    output_language=None,  # allows the output to be translated
    num_keywords=num_keywords,
    num_topics=num_topics,
    corpuses_to_compare=None,  # for TFIDF
    ignore_words=ignore_words,
    prompt_remove_words=True,  # check words with user
    show_progress_bar=True,
    batch_size=32,
)
The BERT keywords are:

['time', 'flight', 'plane', 'southwestair', 'ticket', 'cancel', 'united', 'baggage',
'love', 'virginamerica', 'service', 'customer', 'delay', 'late', 'hour']

Should words be removed [y/n]? y
Type or copy word(s) to be removed: southwestair, united, virginamerica

The new BERT keywords are:

['late', 'baggage', 'service', 'flight', 'time', 'love', 'book', 'customer',
'response', 'hold', 'hour', 'cancel', 'cancelled_flighted', 'delay', 'plane']

Should words be removed [y/n]? n

The model will be rerun until all words known to be unreasonable are removed for a suitable output. kwx.model.gen_files could also be used as a run-all function that produces a directory with a keyword text file and visuals (for experienced users wanting quick results).

Visuals

kwx.visuals includes the following functions for presenting and analyzing the results of keyword extraction:

• Topic Number Evaluation

A graph of topic coherence and overlap given a variable number of topics to derive keywords from.

from kwx.visuals import graph_topic_num_evals
import matplotlib.pyplot as plt

graph_topic_num_evals(
    method=["lda", "bert"],
    text_corpus=text_corpus,
    num_keywords=num_keywords,
    topic_nums_to_compare=list(range(5, 15)),
    metrics=True, #  stability and coherence
)
plt.show()

• t-SNE

t-SNE allows the user to visualize their topic distribution in both two and three dimensions. Currently available just for LDA, this technique provides another check for model suitability.

from kwx.visuals import t_sne
import matplotlib.pyplot as plt

t_sne(
    dimension="both",  # 2d and 3d are options
    text_corpus=text_corpus,
    num_topics=10,
    remove_3d_outliers=True,
)
plt.show()

• pyLDAvis

pyLDAvis is included so that users can inspect LDA extracted topics, and further so that it can easily be generated for output files.

from kwx.visuals import pyLDAvis_topics

pyLDAvis_topics(
    method="lda",
    text_corpus=text_corpus,
    num_topics=10,
    display_ipython=False,  # For Jupyter integration
)

• Word Cloud

Word clouds via wordcloud are included for a basic representation of the text corpus - specifically being a way to convey basic visual information to potential stakeholders. The following figure from examples/kw_extraction shows a word cloud generated from tweets of US air carrier passengers:

from kwx.visuals import gen_word_cloud

ignore_words = ["words", "user", "knows", "they", "don't", "want"]

gen_word_cloud(
    text_corpus=text_corpus,
    ignore_words=None,
    height=500,
)

To-Do

Please see the contribution guidelines if you are interested in contributing to this project. Work that is in progress or could be implemented includes:

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].