All Projects β†’ kafkasl β†’ contextualLSTM

kafkasl / contextualLSTM

Licence: Apache-2.0 license
Contextual LSTM for NLP tasks like word prediction and word embedding creation for Deep Learning

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to contextualLSTM

JoSH
[KDD 2020] Hierarchical Topic Mining via Joint Spherical Tree and Text Embedding
Stars: ✭ 55 (+96.43%)
Mutual labels:  word-embeddings, topic-modeling
NTUA-slp-nlp
πŸ’»Speech and Natural Language Processing (SLP & NLP) Lab Assignments for ECE NTUA
Stars: ✭ 19 (-32.14%)
Mutual labels:  word-embeddings, lstm-neural-networks
Sarcasm Detection
Detecting Sarcasm on Twitter using both traditonal machine learning and deep learning techniques.
Stars: ✭ 73 (+160.71%)
Mutual labels:  topic-modeling, lstm-neural-networks
yelp comments classification nlp
Yelp round-10 review comments classification using deep learning (LSTM and CNN) and natural language processing.
Stars: ✭ 72 (+157.14%)
Mutual labels:  word-embeddings, lstm-neural-networks
Scattertext
Beautiful visualizations of how language differs among document types.
Stars: ✭ 1,722 (+6050%)
Mutual labels:  word-embeddings, topic-modeling
Lftm
Improving topic models LDA and DMM (one-topic-per-document model for short texts) with word embeddings (TACL 2015)
Stars: ✭ 168 (+500%)
Mutual labels:  word-embeddings, topic-modeling
lda2vec
Mixing Dirichlet Topic Models and Word Embeddings to Make lda2vec from this paper https://arxiv.org/abs/1605.02019
Stars: ✭ 27 (-3.57%)
Mutual labels:  word-embeddings, topic-modeling
Text2vec
Fast vectorization, topic modeling, distances and GloVe word embeddings in R.
Stars: ✭ 715 (+2453.57%)
Mutual labels:  word-embeddings, topic-modeling
Top2vec
Top2Vec learns jointly embedded topic, document and word vectors.
Stars: ✭ 972 (+3371.43%)
Mutual labels:  word-embeddings, topic-modeling
Gensim
Topic Modelling for Humans
Stars: ✭ 12,763 (+45482.14%)
Mutual labels:  word-embeddings, topic-modeling
Chameleon recsys
Source code of CHAMELEON - A Deep Learning Meta-Architecture for News Recommender Systems
Stars: ✭ 202 (+621.43%)
Mutual labels:  word-embeddings, lstm-neural-networks
S-WMD
Code for Supervised Word Mover's Distance (SWMD)
Stars: ✭ 90 (+221.43%)
Mutual labels:  word-embeddings
PersianNER
Named-Entity Recognition in Persian Language
Stars: ✭ 48 (+71.43%)
Mutual labels:  word-embeddings
Concept
Concept Modeling: Topic Modeling on Images and Text
Stars: ✭ 119 (+325%)
Mutual labels:  topic-modeling
fuzzymax
Code for the paper: Don't Settle for Average, Go for the Max: Fuzzy Sets and Max-Pooled Word Vectors, ICLR 2019.
Stars: ✭ 43 (+53.57%)
Mutual labels:  word-embeddings
Word-Level-Eng-Mar-NMT
Translating English sentences to Marathi using Neural Machine Translation
Stars: ✭ 37 (+32.14%)
Mutual labels:  lstm-neural-networks
hf-experiments
Experiments with Hugging Face πŸ”¬ πŸ€—
Stars: ✭ 37 (+32.14%)
Mutual labels:  topic-modeling
tesla-stocks-prediction
The implementation of LSTM in TensorFlow used for the stock prediction.
Stars: ✭ 51 (+82.14%)
Mutual labels:  lstm-neural-networks
topic modelling financial news
Topic modelling on financial news with Natural Language Processing
Stars: ✭ 51 (+82.14%)
Mutual labels:  topic-modeling
keras-aquarium
a small collection of models implemented in keras, including matrix factorization(recommendation system), topic modeling, text classification, etc. Runs on tensorflow.
Stars: ✭ 14 (-50%)
Mutual labels:  topic-modeling

contextualLSTM

Contextual LSTM for NLP tasks like word prediction

This repo's goal is to implement de Contextual LSTM model for word prediction as described by [Ghosh, S., Vinyals, O., Strope, B., Roy, S., Dean, T., & Heck, L. (n.d.). Contextual LSTM (CLSTM) models for Large scale NLP tasks. https://doi.org/10.1145/12351]

Notes: there are scripts to run the pipelines. However, the project needs a bit of cleanup. If anyone is interested in using it please write to me or open an issue and I'll fix/help with any error you have.

Data preprocessing and embeddings

Further details about wikipedia data preprocessing at

./documentation/word_embeddings_and_topic_detection.pdf

Context creation with topic detection

Further details of different gensim topic detection methods as well as embeddings arithmetic for context creation at

./documentation/word_embeddings_and_topic_detection_II.pdf

Execution

Download a wikipedia dump for example:

https://dumps.wikimedia.org/enwiki/20180420/enwiki-20180420-pages-articles.xml.bz2

After that use wiki_extractor to process it:

./wiki_extractor_launch.sh path_to_wikipedia_dump

where path_to_wikipedia_dump is the file you downloaded (e.g. enwiki-20180120-pages-articles.xml.bz2)

To run the whole pipeline use the script:

./run_pipeline.sh ../data/enwiki 500

`./preprocess.sh ../data/enwiki 500 2 where:

  • ../data/enwiki is the default path where preprocess script extracted and cleaned the wikipedia dump.
  • 500 is the desired embedding size.

To run just the pipeline with pre-trained embeddings of size 1000 run:

./run_short_pipeline.sh ../data/ 1000

You can download the required trained embeddings from here:

https://www.dropbox.com/s/ws6d8l6h6jp3ldc/embeddings.tar.gz?dl=0

You should place them inside the models/ folder

LSTM

Basic LSTM implementation with TF at ./src/lstm.py

CLSTM

Contextual LSTM implementation with TF at ./src/clstm.py

Although functional, this version is still too slow to be practical for training. If you want to collaborate or have any question regarding it feel free to contact me, I plan to finish it shortly and upload a detailed description of it.

Execution

Most files have their own execution script under /bin folder. All scripts named submit_XXX.sh are designed to be run in a SuperComputer with Slurm queue system. In order to run the locally, just issue the python commands followed by the correct paths.

Note: due to the use of many different packages not all files run with the same Python version (some with 2.7, others with 3.5.2 and the rest 3.6), I expect to unify them (or state clearly the version) soon.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].