All Projects → MaartenGr → Bertopic

MaartenGr / Bertopic

Licence: mit
Leveraging BERT and c-TF-IDF to create easily interpretable topics.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Bertopic

Ldagibbssampling
Open Source Package for Gibbs Sampling of LDA
Stars: ✭ 218 (-70.74%)
Mutual labels:  topic-modeling, topic
Ldavis
R package for web-based interactive topic model visualization.
Stars: ✭ 466 (-37.45%)
Mutual labels:  topic-modeling
pydataberlin-2017
Repo for my talk at the PyData Berlin 2017 conference
Stars: ✭ 63 (-91.54%)
Mutual labels:  topic-modeling
Text mining resources
Resources for learning about Text Mining and Natural Language Processing
Stars: ✭ 358 (-51.95%)
Mutual labels:  topic-modeling
latent-semantic-analysis
Pipeline for training LSA models using Scikit-Learn.
Stars: ✭ 20 (-97.32%)
Mutual labels:  topic-modeling
Pyshorttextcategorization
Various Algorithms for Short Text Mining
Stars: ✭ 429 (-42.42%)
Mutual labels:  topic-modeling
policy-data-analyzer
Building a model to recognize incentives for landscape restoration in environmental policies from Latin America, the US and India. Bringing NLP to the world of policy analysis through an extensible framework that includes scraping, preprocessing, active learning and text analysis pipelines.
Stars: ✭ 22 (-97.05%)
Mutual labels:  topic
Swoole Jobs
🚀Dynamic multi process worker queue base on swoole, like gearman but high performance.
Stars: ✭ 574 (-22.95%)
Mutual labels:  topic
Wsify
Just a tiny, simple and real-time self-hosted pub/sub messaging service
Stars: ✭ 452 (-39.33%)
Mutual labels:  topic
Contextualized Topic Models
A python package to run contextualized topic modeling. CTMs combine BERT with topic models to get coherent topics. Also supports multilingual tasks. Cross-lingual Zero-shot model published at EACL 2021.
Stars: ✭ 318 (-57.32%)
Mutual labels:  topic-modeling
2018 Machinelearning Lectures Esa
Machine Learning Lectures at the European Space Agency (ESA) in 2018
Stars: ✭ 280 (-62.42%)
Mutual labels:  topic-modeling
Lda
LDA topic modeling for node.js
Stars: ✭ 262 (-64.83%)
Mutual labels:  topic-modeling
Corex topic
Hierarchical unsupervised and semi-supervised topic models for sparse count data with CorEx
Stars: ✭ 439 (-41.07%)
Mutual labels:  topic-modeling
topicApp
A simple Shiny App for Topic Modeling in R
Stars: ✭ 40 (-94.63%)
Mutual labels:  topic-modeling
Paper Reading
Paper reading list in natural language processing, including dialogue systems and text generation related topics.
Stars: ✭ 508 (-31.81%)
Mutual labels:  topic-modeling
kwx
BERT, LDA, and TFIDF based keyword extraction in Python
Stars: ✭ 33 (-95.57%)
Mutual labels:  topic-modeling
Hacker News Digest
📰 A responsive interface of Hacker News with summaries and thumbnails.
Stars: ✭ 278 (-62.68%)
Mutual labels:  topic
Guidedlda
semi supervised guided topic model with custom guidedLDA
Stars: ✭ 390 (-47.65%)
Mutual labels:  topic-modeling
Text2vec
Fast vectorization, topic modeling, distances and GloVe word embeddings in R.
Stars: ✭ 715 (-4.03%)
Mutual labels:  topic-modeling
Bigartm
Fast topic modeling platform
Stars: ✭ 563 (-24.43%)
Mutual labels:  topic-modeling

PyPI - Python PyPI - License PyPI - PyPi Build docs DOI

BERTopic

BERTopic is a topic modeling technique that leverages 🤗 transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions. It even supports visualizations similar to LDAvis!

Corresponding medium post can be found here and here.

Installation

Installation can be done using pypi:

pip install bertopic

To use the visualization options, install BERTopic as follows:

pip install bertopic[visualization]

To use Flair embeddings, install BERTopic as follows:

pip install bertopic[flair]

Getting Started

For an in-depth overview of the features of BERTopic you can check the full documentation here or you can follow along with the Google Colab notebook here.

Quick Start

We start by extracting topics from the well-known 20 newsgroups dataset which is comprised of english documents:

from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
 
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

topic_model = BERTopic()
topics, _ = topic_model.fit_transform(docs)

After generating topics and their probabilities, we can access the frequent topics that were generated:

>>> topic_model.get_topic_freq().head()
Topic	Count
-1	7288
49	3992
30	701
27	684
11	568

-1 refers to all outliers and should typically be ignored. Next, let's take a look at the most frequent topic that was generated, topic 49:

>>> topic_model.get_topic(49)
[('windows', 0.006152228076250982),
 ('drive', 0.004982897610645755),
 ('dos', 0.004845038866360651),
 ('file', 0.004140142872194834),
 ('disk', 0.004131678774810884),
 ('mac', 0.003624848635985097),
 ('memory', 0.0034840976976789903),
 ('software', 0.0034415334250699077),
 ('email', 0.0034239554442333257),
 ('pc', 0.003047105930670237)]

NOTE: Use BERTopic(language="multilingual") to select a model that supports 50+ languages.

Visualize Topics

After having trained our BERTopic model, we can iteratively go through perhaps a hundred topic to get a good understanding of the topics that were extract. However, that takes quite some time and lacks a global representation. Instead, we can visualize the topics that were generated in a way very similar to LDAvis:

topic_model.visualize_topics()

Embedding Models

The parameter embedding_model takes in a string pointing to a sentence-transformers model, a SentenceTransformer, or a Flair DocumentEmbedding model.

Sentence-Transformers
You can select any model from sentence-transformers here and pass it through BERTopic with embedding_model:

from bertopic import BERTopic
topic_model = BERTopic(embedding_model="xlm-r-bert-base-nli-stsb-mean-tokens")

Or select a SentenceTransformer model with your own parameters:

from bertopic import BERTopic
from sentence_transformers import SentenceTransformer

sentence_model = SentenceTransformer("distilbert-base-nli-mean-tokens", device="cpu")
topic_model = BERTopic(embedding_model=sentence_model)

Flair
Flair allows you to choose almost any embedding model that is publicly available. Flair can be used as follows:

from bertopic import BERTopic
from flair.embeddings import TransformerDocumentEmbeddings

roberta = TransformerDocumentEmbeddings('roberta-base')
topic_model = BERTopic(embedding_model=roberta)

You can select any 🤗 transformers model here.

Custom Embeddings
You can also use previously generated embeddings by passing it through fit_transform():

topic_model = BERTopic()
topics, _ = topic_model.fit_transform(docs, embeddings)

Dynamic Topic Modeling

Dynamic topic modeling (DTM) is a collection of techniques aimed at analyzing the evolution of topics over time. These methods allow you to understand how a topic is represented across different times. Here, we will be using all of Donald Trump's tweet so see how he talked over certain topics over time:

import re
import pandas as pd

trump = pd.read_csv('https://drive.google.com/uc?export=download&id=1xRKHaP-QwACMydlDnyFPEaFdtskJuBa6')
trump.text = trump.apply(lambda row: re.sub(r"http\S+", "", row.text).lower(), 1)
trump.text = trump.apply(lambda row: " ".join(filter(lambda x:x[0]!="@", row.text.split())), 1)
trump.text = trump.apply(lambda row: " ".join(re.sub("[^a-zA-Z]+", " ", row.text).split()), 1)
trump = trump.loc[(trump.isRetweet == "f") & (trump.text != ""), :]
timestamps = trump.date.to_list()
tweets = trump.text.to_list()

Then, we need to extract the global topic representations by simply creating and training a BERTopic model:

from bertopic import BERTopic

model = BERTopic(verbose=True)
topics, _ = model.fit_transform(tweets)

From these topics, we are going to generate the topic representations at each timestamp for each topic. We do this by simply calling topics_over_time and pass in his tweets, the corresponding timestamps, and the related topics:

topics_over_time = model.topics_over_time(tweets, topics, timestamps)

Finally, we can visualize the topics by simply calling visualize_topics_over_time():

model.visualize_topics_over_time(topics_over_time, top_n=6)

Overview

Methods Code
Fit the model topic_model.fit(docs])
Fit the model and predict documents topic_model.fit_transform(docs])
Predict new documents topic_model.transform([new_doc])
Access single topic topic_model.get_topic(12)
Access all topics topic_model.get_topics()
Get topic freq topic_model.get_topic_freq()
Get all topic information topic_model.get_topic_info()
Visualize Topics topic_model.visualize_topics()
Visualize Topic Probability Distribution topic_model.visualize_distribution(probabilities[0])
Update topic representation topic_model.update_topics(docs, topics, n_gram_range=(1, 3))
Reduce nr of topics topic_model.reduce_topics(docs, topics, nr_topics=30)
Find topics topic_model.find_topics("vehicle")
Save model topic_model.save("my_model")
Load model BERTopic.load("my_model")
Get parameters topic_model.get_params()

Citation

To cite BERTopic in your work, please use the following bibtex reference:

@misc{grootendorst2020bertopic,
  author       = {Maarten Grootendorst},
  title        = {BERTopic: Leveraging BERT and c-TF-IDF to create easily interpretable topics.},
  year         = 2020,
  publisher    = {Zenodo},
  version      = {v0.5.0},
  doi          = {10.5281/zenodo.4430182},
  url          = {https://doi.org/10.5281/zenodo.4430182}
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].