All Projects → yumeng5 → JoSH

yumeng5 / JoSH

Licence: Apache-2.0 license
[KDD 2020] Hierarchical Topic Mining via Joint Spherical Tree and Text Embedding

Programming Languages

c
50402 projects - #5 most used programming language
shell
77523 projects
python
139335 projects - #7 most used programming language
Makefile
30231 projects

Projects that are alternatives of or similar to JoSH

Text2vec
Fast vectorization, topic modeling, distances and GloVe word embeddings in R.
Stars: ✭ 715 (+1200%)
Mutual labels:  text-mining, word-embeddings, topic-modeling
Scattertext
Beautiful visualizations of how language differs among document types.
Stars: ✭ 1,722 (+3030.91%)
Mutual labels:  text-mining, word-embeddings, topic-modeling
lda2vec
Mixing Dirichlet Topic Models and Word Embeddings to Make lda2vec from this paper https://arxiv.org/abs/1605.02019
Stars: ✭ 27 (-50.91%)
Mutual labels:  text-mining, word-embeddings, topic-modeling
Bigartm
Fast topic modeling platform
Stars: ✭ 563 (+923.64%)
Mutual labels:  text-mining, topic-modeling
Pyshorttextcategorization
Various Algorithms for Short Text Mining
Stars: ✭ 429 (+680%)
Mutual labels:  text-mining, topic-modeling
Ldavis
R package for web-based interactive topic model visualization.
Stars: ✭ 466 (+747.27%)
Mutual labels:  text-mining, topic-modeling
How To Mine Newsfeed Data And Extract Interactive Insights In Python
A practical guide to topic mining and interactive visualizations
Stars: ✭ 61 (+10.91%)
Mutual labels:  text-mining, topic-modeling
Lda Topic Modeling
A PureScript, browser-based implementation of LDA topic modeling.
Stars: ✭ 91 (+65.45%)
Mutual labels:  text-mining, topic-modeling
Learning Social Media Analytics With R
This repository contains code and bonus content which will be added from time to time for the book "Learning Social Media Analytics with R" by Packt
Stars: ✭ 102 (+85.45%)
Mutual labels:  text-mining, topic-modeling
Texthero
Text preprocessing, representation and visualization from zero to hero.
Stars: ✭ 2,407 (+4276.36%)
Mutual labels:  text-mining, word-embeddings
Kate
Code & data accompanying the KDD 2017 paper "KATE: K-Competitive Autoencoder for Text"
Stars: ✭ 135 (+145.45%)
Mutual labels:  text-mining, topic-modeling
Shallowlearn
An experiment about re-implementing supervised learning models based on shallow neural network approaches (e.g. fastText) with some additional exclusive features and nice API. Written in Python and fully compatible with Scikit-learn.
Stars: ✭ 196 (+256.36%)
Mutual labels:  text-mining, word-embeddings
Text mining resources
Resources for learning about Text Mining and Natural Language Processing
Stars: ✭ 358 (+550.91%)
Mutual labels:  text-mining, topic-modeling
2018 Machinelearning Lectures Esa
Machine Learning Lectures at the European Space Agency (ESA) in 2018
Stars: ✭ 280 (+409.09%)
Mutual labels:  text-mining, topic-modeling
Nlp Notebooks
A collection of notebooks for Natural Language Processing from NLP Town
Stars: ✭ 513 (+832.73%)
Mutual labels:  text-mining, word-embeddings
Text-Analysis
Explaining textual analysis tools in Python. Including Preprocessing, Skip Gram (word2vec), and Topic Modelling.
Stars: ✭ 48 (-12.73%)
Mutual labels:  text-mining, word-embeddings
kwx
BERT, LDA, and TFIDF based keyword extraction in Python
Stars: ✭ 33 (-40%)
Mutual labels:  text-mining, topic-modeling
converse
Conversational text Analysis using various NLP techniques
Stars: ✭ 147 (+167.27%)
Mutual labels:  text-mining, topic-modeling
teanaps
자연어 처리와 텍스트 분석을 위한 오픈소스 파이썬 라이브러리 입니다.
Stars: ✭ 91 (+65.45%)
Mutual labels:  text-mining, topic-modeling
text-analysis
Weaving analytical stories from text data
Stars: ✭ 12 (-78.18%)
Mutual labels:  text-mining, topic-modeling

JoSH

The source code used for Hierarchical Topic Mining via Joint Spherical Tree and Text Embedding, published in KDD 2020. The code structure (especially file reading and saving functions) is adapted from the Word2Vec implementation.

Requirements

Example Datasets

We provide two example datasets, the New York Times annotated corpus and the arXiv abstract corpus, which are used in the paper. We also provide a shell script run.sh for compiling the source code and performing topic mining on the two example datasets. You should be able to obtain similar results as reported in the paper.

Preparing Your Datasets

Corpus and Inputs

You will need to first create a directory under datasets (e.g., datasets/your_dataset) and put three files in it:

  • A text file of the corpus, e.g., datasets/your_dataset/text.txt. Note: When preparing the text corpus, make sure each line in the file is one document/paragraph.
  • A text file with the category names/keywords for each category, e.g., datasets/your_dataset/category_names.txt where each line contains the category id (starting from 0) and the seed words for the category. You can provide arbitrary number of seed words in each line (at least 1 per category; if there are multiple seed words, separate them with whitespace characters). Note: You need to ensure that every provided seed word appears in the vocabulary of the corpus.
  • A category taxonomy file with the category structure, e.g., datasets/your_dataset/taxonomy.txt where each line contains two category ids separated by a whitespace character. The former category is the parent category of the latter category. Note: You need to ensure that the category ids used in the taxonomy file are consistent with those in the category name file.

Preprocessing

  • You can use any tool to preprocess the corpus (e.g. tokenization, lowercasing). If you do not have a specific idea, you can use our provided preprocessing tool. Simply add your corpus directory to auto_phrase.sh and run it. The script assumes that the raw corpus is named text.txt, and will generate a phrase-segmented, lowercased corpus named phrase_text.txt under the same directory.
  • You need to run src/read_taxo.py to generate two taxonomy information files, matrix_taxonomy.txt which represents the taxonomy in matrix form, and level_taxonomy.txt which records the node level information. See run.sh for an example of using src/read_taxo.py to generate these two files.

Pretrained Embedding (Optional)

We provide a 100-dimensional pretrained JoSE embedding jose_100.zip. You can also use other pretrained embeddings (use the -load-emb argument to specify the pretrained embedding file). Pretrained embedding is optional (omit the -load-emb argument if you do not use pretrained embedding), but generally will result in better embedding initialization and higher-quality topic mining results.

Command Line Arguments

Invoke the command without arguments for a list of parameters and their meanings:

$ ./src/josh
Parameters:
	##########   Input/Output:   ##########
	-train <file> (mandatory argument)
		Use text data from <file> to train the model
	-category-file <file>
		Use <file> to provide the topic names/keywords
	-matrix-file <file>
		Use <file> to provide the taxonomy file in matrix form; generated by read_taxo.py
	-level-file <file>
		Use <file> to provide the node level information file; generated by read_taxo.py
	-res <file>
		Use <file> to save the hierarchical topic mining results
	-k <int>
		Set the number of terms per topic in the output file; default is 10
	-word-emb <file>
		Use <file> to save the resulting word embeddings
	-tree-emb <file>
		Use <file> to save the resulting category embeddings
	-load-emb <file>
		The pretrained embeddings will be read from <file>
	-binary <int>
		Save the resulting vectors in binary moded; default is 0 (off)
	-save-vocab <file>
		The vocabulary will be saved to <file>
	-read-vocab <file>
		The vocabulary will be read from <file>, not constructed from the training data

	##########   Embedding Training:   ##########
	-size <int>
		Set dimension of text embeddings; default is 100
	-iter <int>
		Set the number of iterations to train on the corpus (performing topic mining); default is 5
	-pretrain <int>
		Set the number of iterations to pretrain on the corpus (without performing topic mining); default is 2
	-expand <int>
		Set the number of terms to be added per topic per iteration; default is 1
	-window <int>
		Set max skip length between words; default is 5
	-word-margin <float>
		Set the word embedding learning margin; default is 0.25
	-cat-margin <float>
		Set the intra-category coherence margin m_intra; default is 0.9
	-sample <float>
		Set threshold for occurrence of words. Those that appear with higher frequency in the training data
		will be randomly down-sampled; default is 1e-3, useful range is (0, 1e-5)
	-negative <int>
		Number of negative examples; default is 2, common values are 3 - 5 (0 = not used)
	-threads <int>
		Use <int> threads (default 12)
	-min-count <int>
		This will discard words that appear less than <int> times; default is 5
	-alpha <float>
		Set the starting learning rate; default is 0.025
	-debug <int>
		Set the debug mode (default = 2 = more info during training)

See run.sh for an example to set the arguments

Citations

Please cite the following paper if you find the code helpful for your research.

@inproceedings{meng2020hierarchical,
  title={Hierarchical Topic Mining via Joint Spherical Tree and Text Embedding},
  author={Meng, Yu and Zhang, Yunyi and Huang, Jiaxin and Zhang, Yu and Zhang, Chao and Han, Jiawei},
  booktitle={Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery \& Data Mining},
  year={2020}
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].