Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → bigartm → Bigartm

bigartm / Bigartm

Licence: other

Fast topic modeling platform

Programming Languages

139335 projects - #7 most used programming language

Labels

machine-learning bigdata text-mining topic-modeling python-api

Projects that are alternatives of or similar to Bigartm

Beautiful visualizations of how language differs among document types.

Stars: ✭ 1,722 (+205.86%)

Mutual labels: text-mining, topic-modeling

[KDD 2020] Hierarchical Topic Mining via Joint Spherical Tree and Text Embedding

Stars: ✭ 55 (-90.23%)

Mutual labels: text-mining, topic-modeling

Code & data accompanying the KDD 2017 paper "KATE: K-Competitive Autoencoder for Text"

Stars: ✭ 135 (-76.02%)

Mutual labels: text-mining, topic-modeling

How To Mine Newsfeed Data And Extract Interactive Insights In Python

A practical guide to topic mining and interactive visualizations

Stars: ✭ 61 (-89.17%)

Mutual labels: text-mining, topic-modeling

2018 Machinelearning Lectures Esa

Machine Learning Lectures at the European Space Agency (ESA) in 2018

Stars: ✭ 280 (-50.27%)

Mutual labels: text-mining, topic-modeling

Learning Social Media Analytics With R

This repository contains code and bonus content which will be added from time to time for the book "Learning Social Media Analytics with R" by Packt

Stars: ✭ 102 (-81.88%)

Mutual labels: text-mining, topic-modeling

자연어 처리와 텍스트 분석을 위한 오픈소스 파이썬 라이브러리 입니다.

Stars: ✭ 91 (-83.84%)

Mutual labels: text-mining, topic-modeling

Lda Topic Modeling

A PureScript, browser-based implementation of LDA topic modeling.

Stars: ✭ 91 (-83.84%)

Mutual labels: text-mining, topic-modeling

BERT, LDA, and TFIDF based keyword extraction in Python

Stars: ✭ 33 (-94.14%)

Mutual labels: text-mining, topic-modeling

Mixing Dirichlet Topic Models and Word Embeddings to Make lda2vec from this paper https://arxiv.org/abs/1605.02019

Stars: ✭ 27 (-95.2%)

Mutual labels: text-mining, topic-modeling

Fast vectorization, topic modeling, distances and GloVe word embeddings in R.

Stars: ✭ 715 (+27%)

Mutual labels: text-mining, topic-modeling

Pyshorttextcategorization

Various Algorithms for Short Text Mining

Stars: ✭ 429 (-23.8%)

Mutual labels: text-mining, topic-modeling

Weaving analytical stories from text data

Stars: ✭ 12 (-97.87%)

Mutual labels: text-mining, topic-modeling

Conversational text Analysis using various NLP techniques

Stars: ✭ 147 (-73.89%)

Mutual labels: text-mining, topic-modeling

Text mining resources

Resources for learning about Text Mining and Natural Language Processing

Stars: ✭ 358 (-36.41%)

Mutual labels: text-mining, topic-modeling

R package for web-based interactive topic model visualization.

Stars: ✭ 466 (-17.23%)

Mutual labels: text-mining, topic-modeling

Pytelegrambotapi

Python Telegram bot api.

Stars: ✭ 4,986 (+785.61%)

Mutual labels: python-api

Awesome Sentiment Analysis

Repository with all what is necessary for sentiment analysis and related areas

Stars: ✭ 459 (-18.47%)

Mutual labels: text-mining

Big data architect skills

一个大数据架构师应该掌握的技能

Stars: ✭ 400 (-28.95%)

Mutual labels: bigdata

semi supervised guided topic model with custom guidedLDA

Stars: ✭ 390 (-30.73%)

Mutual labels: topic-modeling

View All Similar Projects ➔

The state-of-the-art platform for topic modeling.

What is BigARTM?

BigARTM is a powerful tool for topic modeling based on a novel technique called Additive Regularization of Topic Models. This technique effectively builds multi-objective models by adding the weighted sums of regularizers to the optimization criterion. BigARTM is known to combine well very different objectives, including sparsing, smoothing, topics decorrelation and many others. Such combination of regularizers significantly improves several quality measures at once almost without any loss of the perplexity.

References

Vorontsov K., Frei O., Apishev M., Romov P., Dudarenko M. BigARTM: Open Source Library for Regularized Multimodal Topic Modeling of Large Collections // Analysis of Images, Social Networks and Texts. 2015.
Vorontsov K., Frei O., Apishev M., Romov P., Dudarenko M., Yanina A. Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large Collections // Proceedings of the 2015 Workshop on Topic Models: Post-Processing and Applications, October 19, 2015 - pp. 29-37.
Vorontsov K., Potapenko A., Plavin A. Additive Regularization of Topic Models for Topic Selection and Sparse Factorization. // Statistical Learning and Data Sciences. 2015 — pp. 193-202.
Vorontsov K. V., Potapenko A. A. Additive Regularization of Topic Models // Machine Learning Journal, Special Issue “Data Analysis and Intelligent Optimization”, Springer, 2014.
More publications can be found in our wiki page.

Related Software Packages

TopicNet is a high-level interface for BigARTM which is helpful for rapid solution prototyping and for exploring the topics of finished ARTM models.
David Blei's List of Open Source topic modeling software
MALLET: Java-based toolkit for language processing with topic modeling package
Gensim: Python topic modeling library
Vowpal Wabbit has an implementation of Online-LDA algorithm

Installation

Installing with pip (Linux only)

We have a PyPi release for Linux:

$ pip install bigartm

or

$ pip install bigartm10

Installing on Windows

We suggest using pre-build binaries.

It is also possible to compile C++ code on Windows you want the latest development version.

Installing on Linux / MacOS

Download binary release or build from source using cmake:

$ mkdir build && cd build
$ cmake ..
$ make install

See here for detailed instructions.

How to Use

Command-line interface

Check out documentation for bigartm.

Examples:

Basic model (20 topics, outputed to CSV-file, inferred in 10 passes)

bigartm.exe -d docword.kos.txt -v vocab.kos.txt --write-model-readable model.txt
--passes 10 --batch-size 50 --topics 20

Basic model with less tokens (filtered extreme values based on token's frequency)

bigartm.exe -d docword.kos.txt -v vocab.kos.txt --dictionary-max-df 50% --dictionary-min-df 2
--passes 10 --batch-size 50 --topics 20 --write-model-readable model.txt

Simple regularized model (increase sparsity up to 60-70%)

bigartm.exe -d docword.kos.txt -v vocab.kos.txt --dictionary-max-df 50% --dictionary-min-df 2
--passes 10 --batch-size 50 --topics 20  --write-model-readable model.txt 
--regularizer "0.05 SparsePhi" "0.05 SparseTheta"

More advanced regularize model, with 10 sparse objective topics, and 2 smooth background topics

bigartm.exe -d docword.kos.txt -v vocab.kos.txt --dictionary-max-df 50% --dictionary-min-df 2
--passes 10 --batch-size 50 --topics obj:10;background:2 --write-model-readable model.txt
--regularizer "0.05 SparsePhi #obj"
--regularizer "0.05 SparseTheta #obj"
--regularizer "0.25 SmoothPhi #background"
--regularizer "0.25 SmoothTheta #background"

Interactive Python interface

BigARTM supports full-featured and clear Python API (see Installation to configure Python API for your OS).

Example:

import artm

# Prepare data
# Case 1: data in CountVectorizer format
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.datasets import fetch_20newsgroups
from numpy import array

cv = CountVectorizer(max_features=1000, stop_words='english')
n_wd = array(cv.fit_transform(fetch_20newsgroups().data).todense()).T
vocabulary = cv.get_feature_names()

bv = artm.BatchVectorizer(data_format='bow_n_wd',
                          n_wd=n_wd,
                          vocabulary=vocabulary)

# Case 2: data in UCI format (https://archive.ics.uci.edu/ml/datasets/Bag+of+Words)
bv = artm.BatchVectorizer(data_format='bow_uci',
                          collection_name='kos',
                          target_folder='kos_batches')

# Learn simple LDA model (or you can use advanced artm.ARTM)
model = artm.LDA(num_topics=15, dictionary=bv.dictionary)
model.fit_offline(bv, num_collection_passes=20)

# Print results
model.get_top_tokens()

Refer to tutorials for details on how to start using BigARTM from Python, user's guide can provide information about more advanced features and cases.

Low-level API

Contributing

Refer to the Developer's Guide and follows Code Style.

To report a bug use issue tracker. To ask a question use our mailing list. Feel free to make pull request.

License

BigARTM is released under New BSD License that allowes unlimited redistribution for any purpose (even for commercial use) as long as its copyright notices and the license’s disclaimers of warranty are maintained.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 563

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (128) 🔗