All Projects → hugochan → Kate

hugochan / Kate

Licence: bsd-3-clause
Code & data accompanying the KDD 2017 paper "KATE: K-Competitive Autoencoder for Text"

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Kate

Srl Zoo
State Representation Learning (SRL) zoo with PyTorch - Part of S-RL Toolbox
Stars: ✭ 125 (-7.41%)
Mutual labels:  autoencoder, representation-learning
Ldavis
R package for web-based interactive topic model visualization.
Stars: ✭ 466 (+245.19%)
Mutual labels:  text-mining, topic-modeling
2018 Machinelearning Lectures Esa
Machine Learning Lectures at the European Space Agency (ESA) in 2018
Stars: ✭ 280 (+107.41%)
Mutual labels:  text-mining, topic-modeling
lda2vec
Mixing Dirichlet Topic Models and Word Embeddings to Make lda2vec from this paper https://arxiv.org/abs/1605.02019
Stars: ✭ 27 (-80%)
Mutual labels:  text-mining, topic-modeling
How To Mine Newsfeed Data And Extract Interactive Insights In Python
A practical guide to topic mining and interactive visualizations
Stars: ✭ 61 (-54.81%)
Mutual labels:  text-mining, topic-modeling
abae-pytorch
PyTorch implementation of 'An Unsupervised Neural Attention Model for Aspect Extraction' by He et al. ACL2017'
Stars: ✭ 52 (-61.48%)
Mutual labels:  autoencoder, topic-modeling
Pyshorttextcategorization
Various Algorithms for Short Text Mining
Stars: ✭ 429 (+217.78%)
Mutual labels:  text-mining, topic-modeling
JoSH
[KDD 2020] Hierarchical Topic Mining via Joint Spherical Tree and Text Embedding
Stars: ✭ 55 (-59.26%)
Mutual labels:  text-mining, topic-modeling
Bagofconcepts
Python implementation of bag-of-concepts
Stars: ✭ 18 (-86.67%)
Mutual labels:  text-mining, representation-learning
Text2vec
Fast vectorization, topic modeling, distances and GloVe word embeddings in R.
Stars: ✭ 715 (+429.63%)
Mutual labels:  text-mining, topic-modeling
Scattertext
Beautiful visualizations of how language differs among document types.
Stars: ✭ 1,722 (+1175.56%)
Mutual labels:  text-mining, topic-modeling
Lda Topic Modeling
A PureScript, browser-based implementation of LDA topic modeling.
Stars: ✭ 91 (-32.59%)
Mutual labels:  text-mining, topic-modeling
autoencoders tensorflow
Automatic feature engineering using deep learning and Bayesian inference using TensorFlow.
Stars: ✭ 66 (-51.11%)
Mutual labels:  autoencoder, representation-learning
kwx
BERT, LDA, and TFIDF based keyword extraction in Python
Stars: ✭ 33 (-75.56%)
Mutual labels:  text-mining, topic-modeling
converse
Conversational text Analysis using various NLP techniques
Stars: ✭ 147 (+8.89%)
Mutual labels:  text-mining, topic-modeling
Text mining resources
Resources for learning about Text Mining and Natural Language Processing
Stars: ✭ 358 (+165.19%)
Mutual labels:  text-mining, topic-modeling
text-analysis
Weaving analytical stories from text data
Stars: ✭ 12 (-91.11%)
Mutual labels:  text-mining, topic-modeling
teanaps
자연어 처리와 텍스트 분석을 위한 오픈소스 파이썬 라이브러리 입니다.
Stars: ✭ 91 (-32.59%)
Mutual labels:  text-mining, topic-modeling
Bigartm
Fast topic modeling platform
Stars: ✭ 563 (+317.04%)
Mutual labels:  text-mining, topic-modeling
Codeslam
Implementation of CodeSLAM — Learning a Compact, Optimisable Representation for Dense Visual SLAM paper (https://arxiv.org/pdf/1804.00874.pdf)
Stars: ✭ 64 (-52.59%)
Mutual labels:  autoencoder, representation-learning

KATE: K-Competitive Autoencoder for Text

Code & data accompanying the KDD2017 paper "KATE: K-Competitive Autoencoder for Text"

Prerequisites

This code is written in python. To use it you will need:

Getting started

To preprocess the corpus, e.g., 20 Newsgroups, just run the following:

    python construct_20news.py -train [train_dir] -test [test_dir] -o [out_dir] -threshold [word_freq_threshold] -topn [top_n_words]

It outputs 4 json files under the [out_dir] directory: train_data, train_label, test_data and test_label. You can download the preprocessed data we used in our experiments here.

To train the KATE model, just run the following:

    python train.py -i [train_data] -nd [num_topics] -ne [num_epochs] -bs [batch_size] -nv [num_validation] -ctype kcomp -ck [top_k] -sm [model_file]

To predict on the test set, just run the following:

    python pred.py -i [test_data] -lm [model_file] -o [output_doc_vec_file] -st [output_topics] -sw [output_sample_words] -wc [output_word_clouds]

To train a simple classifier, just run the following:

  python run_classifier.py [train_doc_codes] [train_doc_labels] [test_doc_codes] [test_doc_labels] -nv [num_validation] -ne [num_epochs] -bs [batch_size]

To train baseline methods, e.g., VAE, just run the following:

     python train_vae.py -i [train_data] -nd [num of dimensions] -ne [num_epochs] -bs [batch_size] -nv [num_validation] -sm [model_file]

Notes

  1. In order to apply the KATE model to your own dataset, you will need to preprocess the dataset on your own. Basically, prepare the vocabulary and Bag-of-Words representation of each document.

  2. The KATE model learns vector representations of words (which are in the vocabulary) as well as documents in an unsupervised manner. It can also extracts topics from corpus. Document labels will be needed only if you want to for example train a document classifier based on learned document vectors.

FAQ

  1. KeyError when plotting word clouds

Make sure the words belong to the vocabulary. See here.

Architecture

Experiment results on 20 Newsgroups

PCA on the 20-D document vectors

20news_doc_vec_pca

TSNE on the 20-D document vectors

20news_doc_vec_tsne

Five nearest neighbors in the word representation space

20news_word_vec

Extracted topics

Text classification results on 20 Newsgroups

Visualization of the normalized topic-word weight matrices of KATE & LDA (KATE learns distinctive patterns)

Reference

If you found this code useful, please cite the following paper:

Yu Chen and Mohammed J. Zaki. "KATE: K-Competitive Autoencoder for Text." In Proceedings of the ACM SIGKDD International Conference on Data Mining and Knowledge Discovery. Aug 2017.

@inproceedings {chen2017kate,
author = { Yu Chen and Mohammed J. Zaki },
title = { KATE: K-Competitive Autoencoder for Text },
booktitle = { Proceedings of the ACM SIGKDD International Conference on Data Mining and Knowledge Discovery },
doi = { http://dx.doi.org/10.1145/3097983.3098017 },
year = { 2017 },
month = { Aug }
}

Other research papers that applied the KATE model:

Chen, Yu, Rhaad M. Rabbani, Aparna Gupta, and Mohammed J. Zaki. "Comparative text analytics via topic modeling in banking." In 2017 IEEE Symposium Series on Computational Intelligence (SSCI), pp. 1-8. IEEE, 2017.

@inproceedings{chen2017comparative,
  title={Comparative text analytics via topic modeling in banking},
  author={Chen, Yu and Rabbani, Rhaad M and Gupta, Aparna and Zaki, Mohammed J},
  booktitle={2017 IEEE Symposium Series on Computational Intelligence (SSCI)},
  pages={1--8},
  year={2017},
  organization={IEEE}
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].