Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → hugochan → Kate

hugochan / Kate

Licence: bsd-3-clause

Code & data accompanying the KDD 2017 paper "KATE: K-Competitive Autoencoder for Text"

Programming Languages

python

139335 projects - #7 most used programming language

Labels

deep-learning autoencoder text-mining representation-learning topic-modeling

Projects that are alternatives of or similar to Kate

Srl Zoo

State Representation Learning (SRL) zoo with PyTorch - Part of S-RL Toolbox

Stars: ✭ 125 (-7.41%)

Mutual labels: autoencoder, representation-learning

Ldavis

R package for web-based interactive topic model visualization.

Stars: ✭ 466 (+245.19%)

Mutual labels: text-mining, topic-modeling

2018 Machinelearning Lectures Esa

Machine Learning Lectures at the European Space Agency (ESA) in 2018

Stars: ✭ 280 (+107.41%)

Mutual labels: text-mining, topic-modeling

lda2vec

Mixing Dirichlet Topic Models and Word Embeddings to Make lda2vec from this paper https://arxiv.org/abs/1605.02019

Stars: ✭ 27 (-80%)

Mutual labels: text-mining, topic-modeling

How To Mine Newsfeed Data And Extract Interactive Insights In Python

A practical guide to topic mining and interactive visualizations

Stars: ✭ 61 (-54.81%)

Mutual labels: text-mining, topic-modeling

abae-pytorch

PyTorch implementation of 'An Unsupervised Neural Attention Model for Aspect Extraction' by He et al. ACL2017'

Stars: ✭ 52 (-61.48%)

Mutual labels: autoencoder, topic-modeling

Pyshorttextcategorization

Various Algorithms for Short Text Mining

Stars: ✭ 429 (+217.78%)

Mutual labels: text-mining, topic-modeling

JoSH

[KDD 2020] Hierarchical Topic Mining via Joint Spherical Tree and Text Embedding

Stars: ✭ 55 (-59.26%)

Mutual labels: text-mining, topic-modeling

Bagofconcepts

Python implementation of bag-of-concepts

Stars: ✭ 18 (-86.67%)

Mutual labels: text-mining, representation-learning

Text2vec

Fast vectorization, topic modeling, distances and GloVe word embeddings in R.

Stars: ✭ 715 (+429.63%)

Mutual labels: text-mining, topic-modeling

Scattertext

Beautiful visualizations of how language differs among document types.

Stars: ✭ 1,722 (+1175.56%)

Mutual labels: text-mining, topic-modeling

Lda Topic Modeling

A PureScript, browser-based implementation of LDA topic modeling.

Stars: ✭ 91 (-32.59%)

Mutual labels: text-mining, topic-modeling

autoencoders tensorflow

Automatic feature engineering using deep learning and Bayesian inference using TensorFlow.

Stars: ✭ 66 (-51.11%)

Mutual labels: autoencoder, representation-learning

kwx

BERT, LDA, and TFIDF based keyword extraction in Python

Stars: ✭ 33 (-75.56%)

Mutual labels: text-mining, topic-modeling

converse

Conversational text Analysis using various NLP techniques

Stars: ✭ 147 (+8.89%)

Mutual labels: text-mining, topic-modeling

Text mining resources

Resources for learning about Text Mining and Natural Language Processing

Stars: ✭ 358 (+165.19%)

Mutual labels: text-mining, topic-modeling

text-analysis

Weaving analytical stories from text data

Stars: ✭ 12 (-91.11%)

Mutual labels: text-mining, topic-modeling

teanaps

자연어 처리와 텍스트 분석을 위한 오픈소스 파이썬 라이브러리 입니다.

Stars: ✭ 91 (-32.59%)

Mutual labels: text-mining, topic-modeling

Bigartm

Fast topic modeling platform

Stars: ✭ 563 (+317.04%)

Mutual labels: text-mining, topic-modeling

Codeslam

Implementation of CodeSLAM — Learning a Compact, Optimisable Representation for Dense Visual SLAM paper (https://arxiv.org/pdf/1804.00874.pdf)

Stars: ✭ 64 (-52.59%)

Mutual labels: autoencoder, representation-learning

View All Similar Projects ➔

KATE: K-Competitive Autoencoder for Text

Code & data accompanying the KDD2017 paper "KATE: K-Competitive Autoencoder for Text"

Prerequisites

This code is written in python. To use it you will need:

Python 2.7
A recent version of Numpy
A recent version of NLTK
Tensorflow = 1.15.2
Keras = 2.0.6

Getting started

To preprocess the corpus, e.g., 20 Newsgroups, just run the following:

    python construct_20news.py -train [train_dir] -test [test_dir] -o [out_dir] -threshold [word_freq_threshold] -topn [top_n_words]

It outputs 4 json files under the [out_dir] directory: train_data, train_label, test_data and test_label. You can download the preprocessed data we used in our experiments here.

To train the KATE model, just run the following:

    python train.py -i [train_data] -nd [num_topics] -ne [num_epochs] -bs [batch_size] -nv [num_validation] -ctype kcomp -ck [top_k] -sm [model_file]

To predict on the test set, just run the following:

    python pred.py -i [test_data] -lm [model_file] -o [output_doc_vec_file] -st [output_topics] -sw [output_sample_words] -wc [output_word_clouds]

To train a simple classifier, just run the following:

  python run_classifier.py [train_doc_codes] [train_doc_labels] [test_doc_codes] [test_doc_labels] -nv [num_validation] -ne [num_epochs] -bs [batch_size]

To train baseline methods, e.g., VAE, just run the following:

     python train_vae.py -i [train_data] -nd [num of dimensions] -ne [num_epochs] -bs [batch_size] -nv [num_validation] -sm [model_file]

Notes

In order to apply the KATE model to your own dataset, you will need to preprocess the dataset on your own. Basically, prepare the vocabulary and Bag-of-Words representation of each document.
The KATE model learns vector representations of words (which are in the vocabulary) as well as documents in an unsupervised manner. It can also extracts topics from corpus. Document labels will be needed only if you want to for example train a document classifier based on learned document vectors.

FAQ

KeyError when plotting word clouds

Make sure the words belong to the vocabulary. See here.

Architecture

Experiment results on 20 Newsgroups

PCA on the 20-D document vectors

TSNE on the 20-D document vectors

Five nearest neighbors in the word representation space

Extracted topics

Text classification results on 20 Newsgroups

Visualization of the normalized topic-word weight matrices of KATE & LDA (KATE learns distinctive patterns)

Reference

If you found this code useful, please cite the following paper:

Yu Chen and Mohammed J. Zaki. "KATE: K-Competitive Autoencoder for Text." In Proceedings of the ACM SIGKDD International Conference on Data Mining and Knowledge Discovery. Aug 2017.

@inproceedings {chen2017kate,
author = { Yu Chen and Mohammed J. Zaki },
title = { KATE: K-Competitive Autoencoder for Text },
booktitle = { Proceedings of the ACM SIGKDD International Conference on Data Mining and Knowledge Discovery },
doi = { http://dx.doi.org/10.1145/3097983.3098017 },
year = { 2017 },
month = { Aug }
}

Other research papers that applied the KATE model:

Chen, Yu, Rhaad M. Rabbani, Aparna Gupta, and Mohammed J. Zaki. "Comparative text analytics via topic modeling in banking." In 2017 IEEE Symposium Series on Computational Intelligence (SSCI), pp. 1-8. IEEE, 2017.

@inproceedings{chen2017comparative,
  title={Comparative text analytics via topic modeling in banking},
  author={Chen, Yu and Rabbani, Rhaad M and Gupta, Aparna and Zaki, Mohammed J},
  booktitle={2017 IEEE Symposium Series on Computational Intelligence (SSCI)},
  pages={1--8},
  year={2017},
  organization={IEEE}
}

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 135

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (4) 🔗