All Projects → yao8839836 → KGE-LDA

yao8839836 / KGE-LDA

Licence: other
Knowledge Graph Embedding LDA. AAAI 2017

Programming Languages

java
68154 projects - #9 most used programming language

Projects that are alternatives of or similar to KGE-LDA

lda2vec
Mixing Dirichlet Topic Models and Word Embeddings to Make lda2vec from this paper https://arxiv.org/abs/1605.02019
Stars: ✭ 27 (-22.86%)
Mutual labels:  embeddings, topic-modeling, lda
Familia
A Toolkit for Industrial Topic Modeling
Stars: ✭ 2,499 (+7040%)
Mutual labels:  topic-modeling, lda
Lda Topic Modeling
A PureScript, browser-based implementation of LDA topic modeling.
Stars: ✭ 91 (+160%)
Mutual labels:  topic-modeling, lda
Ldagibbssampling
Open Source Package for Gibbs Sampling of LDA
Stars: ✭ 218 (+522.86%)
Mutual labels:  topic-modeling, lda
pydataberlin-2017
Repo for my talk at the PyData Berlin 2017 conference
Stars: ✭ 63 (+80%)
Mutual labels:  topic-modeling, lda
Lda
LDA topic modeling for node.js
Stars: ✭ 262 (+648.57%)
Mutual labels:  topic-modeling, lda
Sttm
Short Text Topic Modeling, JAVA
Stars: ✭ 100 (+185.71%)
Mutual labels:  topic-modeling, lda
tomoto-ruby
High performance topic modeling for Ruby
Stars: ✭ 49 (+40%)
Mutual labels:  topic-modeling, lda
TopicsExplorer
Explore your own text collection with a topic model – without prior knowledge.
Stars: ✭ 53 (+51.43%)
Mutual labels:  topic-modeling, lda
Topic-Modeling-Workshop-with-R
A workshop on analyzing topic modeling (LDA, CTM, STM) using R
Stars: ✭ 51 (+45.71%)
Mutual labels:  topic-modeling, lda
kwx
BERT, LDA, and TFIDF based keyword extraction in Python
Stars: ✭ 33 (-5.71%)
Mutual labels:  topic-modeling, lda
PyLDA
A Latent Dirichlet Allocation implementation in Python.
Stars: ✭ 51 (+45.71%)
Mutual labels:  topic-modeling, lda
NMFADMM
A sparsity aware implementation of "Alternating Direction Method of Multipliers for Non-Negative Matrix Factorization with the Beta-Divergence" (ICASSP 2014).
Stars: ✭ 39 (+11.43%)
Mutual labels:  topic-modeling, lda
Lightlda
fast sampling algorithm based on CGS
Stars: ✭ 49 (+40%)
Mutual labels:  topic-modeling, lda
Contextualized Topic Models
A python package to run contextualized topic modeling. CTMs combine BERT with topic models to get coherent topics. Also supports multilingual tasks. Cross-lingual Zero-shot model published at EACL 2021.
Stars: ✭ 318 (+808.57%)
Mutual labels:  embeddings, topic-modeling
hlda
Gibbs sampler for the Hierarchical Latent Dirichlet Allocation topic model
Stars: ✭ 138 (+294.29%)
Mutual labels:  topic-modeling, lda
amazon-reviews
Sentiment Analysis & Topic Modeling with Amazon Reviews
Stars: ✭ 26 (-25.71%)
Mutual labels:  topic-modeling, lda
VarCLR
VarCLR: Variable Semantic Representation Pre-training via Contrastive Learning
Stars: ✭ 30 (-14.29%)
Mutual labels:  embeddings
muse-as-service
REST API for sentence tokenization and embedding using Multilingual Universal Sentence Encoder.
Stars: ✭ 45 (+28.57%)
Mutual labels:  embeddings
TopicNet
Interface for easier topic modelling.
Stars: ✭ 127 (+262.86%)
Mutual labels:  topic-modeling

KGE-LDA

The implementation of Knowledge Graph Embedding LDA in our paper:

Liang Yao, Yin Zhang, Baogang Wei, Zhe Jin, Rui Zhang, Yangyang Zhang, and Qinfei Chen. "Incorporating Knowledge Graph Embeddings into Topic Modeling." In Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17), pp. 3119-3126. AAAI Press, 2017.

Require Java 7+ and Eclipse.

Implementation

KGE-LDA(a): KGE-LDA/src/topic/LinkLDA_KGE.java

KGE-LDA(b): KGE-LDA/src/topic/CorrLDA_KGE.java

Main entries

The main entries for KGE-LDA(a), KGE-LDA(b), Corr-LDA, CI-LDA (Link-LDA), CTM and LDA on the three datasets 20NG, NIPS and Ohsumed are in KGE-LDA/src/result20ng/, KGE-LDA/src/resultnips/ and KGE-LDA/src/resultohsumed/. The main entries are ResultXXXXXX.java

The main entries for GK-LDA on the three datasets 20NG, NIPS and Ohsumed are in GKLDA-master/Src/src/launch. The main entries are ResultGKLDAXXX.java

The main entries for LF-LDA on the three datasets 20NG, NIPS and Ohsumed are in LFTM/src/test/. The main entries are ResultLFLDAXXX.java

Reproduce results

To reproduce the results in the paper:

(1) Decompress /KGE-LDA/data.zip and /KGE-LDA/file.7z in the same fold.

(2) Download the three Wikipedia index files 20ng_word_wiki_small_index.zip (http://pan.baidu.com/s/1qXDVoVq and https://drive.google.com/open?id=1yA3tJNIFoZYQQCMMieQCzN0ABUscJNYB), nips_word_wiki_index_small.rar (http://pan.baidu.com/s/1hs9HZve and https://drive.google.com/file/d/1z7y003TsprENKJ3V92sG_SHwst5MFWBN/view?usp=sharing) and ohsumed_23_word_wiki_index.zip (http://pan.baidu.com/s/1miATVkO and https://drive.google.com/open?id=127zVR2GH7AXvIL1xRbXWWzQaSDwzxtLG), decompress them and put the decompressed folders to the GKLDA-master/Src/file/, /KGE-LDA/file/ and /LFTM/file/. The 4,776,093 Wikipedia articles are at (http://pan.baidu.com/s/1slaTPoT and https://drive.google.com/open?id=1JTLu-AqNhf7xUWTgKKm8360wT0OQve7O), I extracted them from http://deepdive.stanford.edu/opendata/.

(3) Run the main entries.

Others

(1) The raw text datasets are in three folds under /KGE-LDA/data/.

(2) The tokenized documents are in /KGE-LDA/file/20ng/, /KGE-LDA/file/nips/ and /KGE-LDA/file/ohsumed/. Documents after stopwords removing are in /KGE-LDA/file/xxx_remove_stop. Documents after stopwords and rare words removing are in /KGE-LDA/file/xxx_remove_rare.

(3) The input text for the KGE-LDA model should be like (please decompress data.zip): data//corpus_20ng.txt. Each line represents a document. Each number is an index of a word in the vocabulary.

(4) The vocabulary should be like (please decompress data.zip): data//vocab_20ng.txt. Each line is a word, the first word is 0 in data//corpus_20ng.txt, the second word is 1 in data//corpus_20ng.txt, the third word is 2 in data//corpus_20ng.txt...

(5) The linked entities in WordNet(via NLTK) of each document are in /KGE-LDA/file/20ng_wordnet/, /KGE-LDA/file/nips_wordnet/ and /KGE-LDA/file/ohsumed_wordnet/. Their ids are in /KGE-LDA/file/xxx_wordnet_id/. Each file name represents the index of the document.

(6) file/runnltk.py is an example of entity linking for Ohsumed dataset.

(7) To tokenize your own documents, you also need to download the model file of Stanford CoreNLP (http://pan.baidu.com/s/1bpDqa7d) and add it to the class path.

(8) The unique entities ids for 20NG are in /KGE-LDA/knowledge/WN18/entity_appear.txt, the unique entities ids for NIPS are in /KGE-LDA/knowledge/WN18/entity_appear_nips.txt, the unique entities ids for Ohsumed are in /KGE-LDA/knowledge/WN18/entity_appear_ohsumed.txt

(9) I used this implementation of TransE to obtain entity embeddings: https://github.com/thunlp/KB2E. See the Readme of the project for more details about how to prepare knowledge graphs and obtain embeddings.

(10) The 50 dimensional entity embeddings for 20NG are in /KGE-LDA/knowledge/WN18/entity2vec_appear.bern, the 50 dimensional entity embeddings for NIPS are in /KGE-LDA/knowledge/WN18/entity2vec_appear_nips.bern, the 50 dimensional entity embeddings for Ohsumed are in /KGE-LDA/knowledge/WN18/entity2vec_appear_ohsumed.bern. To get all 50 dimensional entity embeddings in WN18, see the three files: /KGE-LDA/knowledge/WN18/entity2vec.bern, /KGE-LDA/knowledge/WN18/entity2id.txt, /KGE-LDA/knowledge/WN18/num_synset.txt

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].