All Projects → TropComplique → Lda2vec Pytorch

TropComplique / Lda2vec Pytorch

Licence: mit
Topic modeling with word vectors

Projects that are alternatives of or similar to Lda2vec Pytorch

2018 Machinelearning Lectures Esa
Machine Learning Lectures at the European Space Agency (ESA) in 2018
Stars: ✭ 280 (+159.26%)
Mutual labels:  jupyter-notebook, topic-modeling
Learning Vis Tools
Learning Vis Tools: Tutorial materials for Data Visualization course at HKUST
Stars: ✭ 108 (+0%)
Mutual labels:  jupyter-notebook
Dwc
Darwin Core
Stars: ✭ 106 (-1.85%)
Mutual labels:  jupyter-notebook
Kalman And Bayesian Filters In Python
Kalman Filter book using Jupyter Notebook. Focuses on building intuition and experience, not formal proofs. Includes Kalman filters,extended Kalman filters, unscented Kalman filters, particle filters, and more. All exercises include solutions.
Stars: ✭ 11,233 (+10300.93%)
Mutual labels:  jupyter-notebook
Ml Ai Experiments
All my experiments with AI and ML
Stars: ✭ 107 (-0.93%)
Mutual labels:  jupyter-notebook
Getting Started With Google Bert
Build and train state-of-the-art natural language processing models using BERT
Stars: ✭ 107 (-0.93%)
Mutual labels:  jupyter-notebook
Texas Hold Em Ai
Research on Texas Hold'em AI
Stars: ✭ 107 (-0.93%)
Mutual labels:  jupyter-notebook
Py Wsi
Python package for dealing with whole slide images (.svs) for machine learning, particularly for fast prototyping. Includes patch sampling and storing using OpenSlide. Patches may be stored in LMDB, HDF5 files, or to disk. It is highly recommended to fork and download this repository so that personal customisations can be made for your work.
Stars: ✭ 107 (-0.93%)
Mutual labels:  jupyter-notebook
Robustness applications
Notebooks for reproducing the paper "Computer Vision with a Single (Robust) Classifier"
Stars: ✭ 108 (+0%)
Mutual labels:  jupyter-notebook
Numpy Ml
Machine learning, in numpy
Stars: ✭ 11,100 (+10177.78%)
Mutual labels:  topic-modeling
Prml
PRML algorithms implemented in Python
Stars: ✭ 10,206 (+9350%)
Mutual labels:  jupyter-notebook
Facemaskdetection
开源人脸口罩检测模型和数据 Detect faces and determine whether people are wearing mask.
Stars: ✭ 1,677 (+1452.78%)
Mutual labels:  jupyter-notebook
Ganlocalediting
Stars: ✭ 108 (+0%)
Mutual labels:  jupyter-notebook
Aa228 Notebook
IJulia notebooks for AA228/CS238 Decision Making Under Uncertainty course at Stanford University
Stars: ✭ 107 (-0.93%)
Mutual labels:  jupyter-notebook
Ultra96 Pynq
Board files to build Ultra 96 PYNQ image
Stars: ✭ 108 (+0%)
Mutual labels:  jupyter-notebook
Tf Mrnn
Re-implementation of the m-RNN model using TensorFLow
Stars: ✭ 107 (-0.93%)
Mutual labels:  jupyter-notebook
Pyldavis
Python library for interactive topic model visualization. Port of the R LDAvis package.
Stars: ✭ 1,550 (+1335.19%)
Mutual labels:  jupyter-notebook
Tensorflow Examples
TensorFlow Tutorial and Examples for Beginners (support TF v1 & v2)
Stars: ✭ 41,480 (+38307.41%)
Mutual labels:  jupyter-notebook
Sw machine learning
machine learning
Stars: ✭ 108 (+0%)
Mutual labels:  jupyter-notebook
Ml Demos
Python code examples for the feedly Machine Learning blog (https://blog.feedly.com/category/all/Machine-Learning/)
Stars: ✭ 108 (+0%)
Mutual labels:  jupyter-notebook

lda2vec

pytorch implementation of Moody's lda2vec, a way of topic modeling using word embeddings.
The original paper: Mixing Dirichlet Topic Models and Word Embeddings to Make lda2vec.

Warning: I, personally, believe that it is quite hard to make lda2vec algorithm work.
Sometimes it finds a couple of topics, sometimes not. Usually a lot of found topics are a total mess.
The algorithm is prone to poor local minima. It greatly depends on values of initial topic assignments.

For my results see 20newsgroups/explore_trained_model.ipynb. Also see Implementation details below.

Loss

The training proceeds as follows. First, convert a document corpus to a set of tuples
{(document id, word, the window around the word) | for each word in the corpus}.
Second, for each tuple maximize the following objective function

objective function

where c - context vector, w - embedding vector for a word, lambda - positive constant that controls sparsity, i - sum over the window around the word, k - sum over sampled negative words, j - sum over topics, p - probability distribution over topics for a document, t - topic vectors.
When training I also shuffle and batch the tuples.

How to use it

  1. Go to 20newsgroups/.
  2. Run get_windows.ipynb to prepare data.
  3. Run python train.py for training.
  4. Run explore_trained_model.ipynb.

To use this on your data you need to edit get_windows.ipynb. Also there are hyperparameters in 20newsgroups/train.py, utils/training.py, utils/lda2vec_loss.py.

Implementation details

  • I use vanilla LDA to initialize lda2vec (topic assignments for each document). It is not like in the original paper. It is not how it supposed to work. But without this results are quite bad.
    Also I use temperature to smoothen the initialization in the hope that lda2vec will have a chance to find better topic assignments.
  • I add noise to some gradients while training.
  • I reweight loss according to document lengths.
  • Before training lda2vec I train 50-dimensional skip-gram word2vec to initialize the word embeddings.
  • For text preprocessing:
    1. do word lemmatization
    2. remove rare and frequent words

Requirements

  • pytorch 0.2, spacy 1.9, gensim 3.0
  • numpy, sklearn, tqdm
  • matplotlib, Multicore-TSNE
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].