All Projects → harkous → Embeddingsviz

harkous / Embeddingsviz

Licence: mit
Visualize word embeddings of a vocabulary in TensorBoard, including the neighbors

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Embeddingsviz

Magnitude
A fast, efficient universal vector embedding utility package.
Stars: ✭ 1,394 (+3385%)
Mutual labels:  embeddings, word-embeddings, fasttext, glove
datastories-semeval2017-task6
Deep-learning model presented in "DataStories at SemEval-2017 Task 6: Siamese LSTM with Attention for Humorous Text Comparison".
Stars: ✭ 20 (-50%)
Mutual labels:  word-embeddings, embeddings, glove
Simple-Sentence-Similarity
Exploring the simple sentence similarity measurements using word embeddings
Stars: ✭ 99 (+147.5%)
Mutual labels:  word-embeddings, glove, fasttext
Finalfusion Rust
finalfusion embeddings in Rust
Stars: ✭ 35 (-12.5%)
Mutual labels:  embeddings, fasttext, glove
Lmdb Embeddings
Fast word vectors with little memory usage in Python
Stars: ✭ 404 (+910%)
Mutual labels:  embeddings, fasttext, glove
Fastrtext
R wrapper for fastText
Stars: ✭ 103 (+157.5%)
Mutual labels:  embeddings, word-embeddings, fasttext
Embedding As Service
One-Stop Solution to encode sentence to fixed length vectors from various embedding techniques
Stars: ✭ 151 (+277.5%)
Mutual labels:  embeddings, fasttext, glove
Datastories Semeval2017 Task4
Deep-learning model presented in "DataStories at SemEval-2017 Task 4: Deep LSTM with Attention for Message-level and Topic-based Sentiment Analysis".
Stars: ✭ 184 (+360%)
Mutual labels:  embeddings, word-embeddings, glove
Keras Textclassification
中文长文本分类、短句子分类、多标签分类、两句子相似度(Chinese Text Classification of Keras NLP, multi-label classify, or sentence classify, long or short),字词句向量嵌入层(embeddings)和网络层(graph)构建基类,FastText,TextCNN,CharCNN,TextRNN, RCNN, DCNN, DPCNN, VDCNN, CRNN, Bert, Xlnet, Albert, Attention, DeepMoji, HAN, 胶囊网络-CapsuleNet, Transformer-encode, Seq2seq, SWEM, LEAM, TextGCN
Stars: ✭ 914 (+2185%)
Mutual labels:  embeddings, fasttext
navec
Compact high quality word embeddings for Russian language
Stars: ✭ 118 (+195%)
Mutual labels:  embeddings, glove
word2vec-tsne
Google News and Leo Tolstoy: Visualizing Word2Vec Word Embeddings using t-SNE.
Stars: ✭ 59 (+47.5%)
Mutual labels:  word-embeddings, embeddings
compress-fasttext
Tools for shrinking fastText models (in gensim format)
Stars: ✭ 124 (+210%)
Mutual labels:  word-embeddings, fasttext
NLP-paper
🎨 🎨NLP 自然语言处理教程 🎨🎨 https://dataxujing.github.io/NLP-paper/
Stars: ✭ 23 (-42.5%)
Mutual labels:  glove, fasttext
lda2vec
Mixing Dirichlet Topic Models and Word Embeddings to Make lda2vec from this paper https://arxiv.org/abs/1605.02019
Stars: ✭ 27 (-32.5%)
Mutual labels:  word-embeddings, embeddings
sentiment-analysis-of-tweets-in-russian
Sentiment analysis of tweets in Russian using Convolutional Neural Networks (CNN) with Word2Vec embeddings.
Stars: ✭ 51 (+27.5%)
Mutual labels:  word-embeddings, embeddings
SentimentAnalysis
Sentiment Analysis: Deep Bi-LSTM+attention model
Stars: ✭ 32 (-20%)
Mutual labels:  word-embeddings, embeddings
PersianNER
Named-Entity Recognition in Persian Language
Stars: ✭ 48 (+20%)
Mutual labels:  word-embeddings, embeddings
Biosentvec
BioWordVec & BioSentVec: pre-trained embeddings for biomedical words and sentences
Stars: ✭ 308 (+670%)
Mutual labels:  word-embeddings, fasttext
Persian-Sentiment-Analyzer
Persian sentiment analysis ( آناکاوی سهش های فارسی | تحلیل احساسات فارسی )
Stars: ✭ 30 (-25%)
Mutual labels:  embeddings, fasttext
Wego
Word Embeddings (e.g. Word2Vec) in Go!
Stars: ✭ 336 (+740%)
Mutual labels:  word-embeddings, glove

Embeddings Visualizer in TensorBoard

Problem

Suppose you have a large word embeddings file at hand (e.g. GloVe) and that you want to visualize these embeddings in TensorBoard. The problem is that TensorBoard becomes very slow at doing this task as the number of total words exceeds tens of thousands, especially that it does computations in the browser. Hence, the way to go is to limit your vocabulary to subset of words that are of interest to you and visualize their neighbors only. This repository aims to automate this task. You input a set of vocabulary terms of interest in addition to your embeddings. Then you can visualize these words and their neighbors within TensorBoard.

The repository uses Faiss library from Facebook in addition to the latest TensorFlow from Google. It supports including multiple embeddings in the same TensorBoard session.

It is tested on with TensorFlow 1.2.1 under Python 2.7 (It is more straightforward to install Faiss with Python 2.7).

Prerequisites Setup

  1. Install Faiss, Facebook's library for efficient similarity search, by following their guide

    • For example, on Ubuntu 14 (CPU installation), I followed the below steps:
    # Clone faiss
    git clone https://github.com/facebookresearch/faiss.git
    cd faiss
    # copy the make file
    cp example_makefiles/makefile.inc.Linux ./makefile.inc
    #  Uncomment the part for your system in makefile.inc and apply the commands. E.g. for Ubuntu 14, I applied `sudo apt-get install libopenblas-dev liblapack3 python-numpy python-dev` and uncommented the line starting with BLASLDFLAGS
    vi ./makefile.inc
    # for the cpu installation:
    make tests/test_blas
    make
    make py
    
  2. Create the python virtual environment in order to install the project prerequisites there, without affecting the rest of your python environment. I executed the below commands. You might need to install the virtual environment using sudo apt-get install python-pip python-dev python-virtualenv. If you use Anaconda, you can do the corresponding steps there.

    virtualenv --system-site-packages venv_dir
    source venv_dir/bin/activate
    
  3. Add Faiss to the python path to use it, e.g., if the directory is FAISS_DIRECTORY, you can issue:

    export PYTHONPATH=FAISS_DIRECTORY:$PYTHONPATH
    
  4. Install the rest of the dependencies (basicall tensorflow and numpy):

    pip install --upgrade pip
    pip install -r requirements.txt
    

Running the Code

  1. The first step is to obtain the embeddings of the vocabulary we have and their neighbors. For that, we run:

    cd embeddingsviz
    python embeddings_knn.py -e ORIGINAL_EMBEDDINGS_FILE -v VOCAB_TXT_FILE -o OUTPUT_EMBEDDINGS_FILE -k NUM_NEIGHBORS
    # e.g.: python embeddings_knn.py -e ~/data/fasttext.vec -v ./vocab_file.txt -o ./fasttext_subset_1.vec -k 100
    

    The ORIGINAL_EMBEDDINGS_FILE is assumed to be of the following format. The first line is a header setting the vocabulary size and the embeddings dimension. This is the format used in fastText.

    VOCAB_SIZE EMBEDDING_DIMENSIONS
    word_1 vec_1
    word_2 vec_2
    

    However, the code will also work with another format which does not has a header (e.g., the default GloVe format).

    This step has to be executed for each embeddings file you want. The VOCAB_TXT_FILE has one word per line. NUM_NEIGHBORS has to be chosen so that the total number of words in the vocab and their neighbors is not very large (e.g., they should add up to ~10,000 words).

  2. The second step is to convert the resulting embeddings of your vocab and their neighbors into a format that TensorBoard understands and place them in the log directory:

    python embeddings_formatter.py -l LOGS_DIRECTORY  -f EMBEDDINGS_FILE_1  EMBEDDINGS_FILE_2  -n NAME_1 NAME_2
    # e.g.: python embeddings_formatter.py -l logs  -f ./fasttext_subset_1.vec ./fasttext_subset_2.vec -n subset_1 subset_2
    
  3. The final step is to run TensorBoard, pointing it to this directory:

    tensorboard --logdir=logs --port=6006
    
  4. Now you can point your browser to the embeddings visualization, e.g. http://server_address:6006/#embeddings. You will see an interface like the following: Screenshot

Developer

Hamza Harkous

License

MIT

References:

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].