All Projects → cmasch → word-embeddings-from-scratch

cmasch / word-embeddings-from-scratch

Licence: other
Creating word embeddings from scratch and visualize them on TensorBoard. Using trained embeddings in Keras.

Programming Languages

Jupyter Notebook
11667 projects

Projects that are alternatives of or similar to word-embeddings-from-scratch

Ml Projects
ML based projects such as Spam Classification, Time Series Analysis, Text Classification using Random Forest, Deep Learning, Bayesian, Xgboost in Python
Stars: ✭ 127 (+477.27%)
Mutual labels:  word2vec, gensim, tensorboard
Lmdb Embeddings
Fast word vectors with little memory usage in Python
Stars: ✭ 404 (+1736.36%)
Mutual labels:  word2vec, embeddings, gensim
Magnitude
A fast, efficient universal vector embedding utility package.
Stars: ✭ 1,394 (+6236.36%)
Mutual labels:  word2vec, embeddings, gensim
Embedding As Service
One-Stop Solution to encode sentence to fixed length vectors from various embedding techniques
Stars: ✭ 151 (+586.36%)
Mutual labels:  word2vec, embeddings
Turkish Word2vec
Pre-trained Word2Vec Model for Turkish
Stars: ✭ 136 (+518.18%)
Mutual labels:  word2vec, gensim
Wordembeddings Elmo Fasttext Word2vec
Using pre trained word embeddings (Fasttext, Word2Vec)
Stars: ✭ 146 (+563.64%)
Mutual labels:  word2vec, gensim
Awesome Embedding Models
A curated list of awesome embedding models tutorials, projects and communities.
Stars: ✭ 1,486 (+6654.55%)
Mutual labels:  word2vec, embeddings
Entity2rec
entity2rec generates item recommendation using property-specific knowledge graph embeddings
Stars: ✭ 159 (+622.73%)
Mutual labels:  word2vec, embeddings
Webvectors
Web-ify your word2vec: framework to serve distributional semantic models online
Stars: ✭ 154 (+600%)
Mutual labels:  word2vec, gensim
Log Anomaly Detector
Log Anomaly Detection - Machine learning to detect abnormal events logs
Stars: ✭ 169 (+668.18%)
Mutual labels:  word2vec, gensim
Shallowlearn
An experiment about re-implementing supervised learning models based on shallow neural network approaches (e.g. fastText) with some additional exclusive features and nice API. Written in Python and fully compatible with Scikit-learn.
Stars: ✭ 196 (+790.91%)
Mutual labels:  word2vec, gensim
Role2vec
A scalable Gensim implementation of "Learning Role-based Graph Embeddings" (IJCAI 2018).
Stars: ✭ 134 (+509.09%)
Mutual labels:  word2vec, gensim
Germanwordembeddings
Toolkit to obtain and preprocess german corpora, train models using word2vec (gensim) and evaluate them with generated testsets
Stars: ✭ 189 (+759.09%)
Mutual labels:  word2vec, gensim
Sensegram
Making sense embedding out of word embeddings using graph-based word sense induction
Stars: ✭ 209 (+850%)
Mutual labels:  word2vec, embeddings
TF2DeepFloorplan
TF2 Deep FloorPlan Recognition using a Multi-task Network with Room-boundary-Guided Attention. Enable tensorboard, quantization, flask, tflite, docker, github actions and google colab.
Stars: ✭ 98 (+345.45%)
Mutual labels:  tensorboard, tensorflow2
Dna2vec
dna2vec: Consistent vector representations of variable-length k-mers
Stars: ✭ 117 (+431.82%)
Mutual labels:  word2vec, embeddings
Gensim
Topic Modelling for Humans
Stars: ✭ 12,763 (+57913.64%)
Mutual labels:  word2vec, gensim
Dict2vec
Dict2vec is a framework to learn word embeddings using lexical dictionaries.
Stars: ✭ 91 (+313.64%)
Mutual labels:  word2vec, embeddings
Cw2vec
cw2vec: Learning Chinese Word Embeddings with Stroke n-gram Information
Stars: ✭ 224 (+918.18%)
Mutual labels:  word2vec, embeddings
Splitter
A Pytorch implementation of "Splitter: Learning Node Representations that Capture Multiple Social Contexts" (WWW 2019).
Stars: ✭ 177 (+704.55%)
Mutual labels:  word2vec, gensim

Word embeddings from scratch and visualization

If you are working with documents one approach is to create word embeddings that allows to represent words with similar meaning.

*** UPDATE *** - February 18th, 2020

Updated the code to work with TensorFlow 2. Fix for the deprecation warning will coming soon.

In this jupyter notebook I would like to show how you can create embeddings from scratch using gensim and visualize them on TensorBoard in a simple way.
Some time ago I tried the build-in method word2vec2tensor of gensim to use TensorBoard, but without success. Therefore I implemented this version in combination with TensorFlow.

For this example I used a subset of 200000 documents of the Yelp dataset. This is a great dataset that included different languages but mostly english reviews.

As you can see in my animation, it learns the representation of similiar words from scratch. German and other languages are also included!

You can improve the results by tuning some parameters of word2vec, using t-SNE or modifying the preprocessing.

Usage

Because of the dataset license I can't publish my training data nor the trained embeddings. Feel free to use the notebook for your own dataset or request the data on Yelp. Just put your text-files in the defined directory TEXT_DIR. Everything will be saved in folder defined by MODEL_PATH.

Finally start TensorBoard:

tensorboard --logdir emb_yelp/

Using trained embeddings in Keras

If you would like to use your own trained embeddings for neural networks, you can load the trained weights (vectors) in an embedding layer (e.g. Keras). This is really useful, especially if you have just a few samples to train your network on. Another reason is that existing pre-trained models like Google word2vec or GloVe are maybe not sufficient because they are not task-specific embeddings.

If you need an example how to use trained embeddings of gensim in Keras, please take a look at the code snippet below. This is similiar to this jupyter notebook where I used GloVe. But loading gensim weights is quite a bit different.

def get_embedding_weights(gensim_model, tokenizer, max_num_words, embedding_dim):
    model = gensim.models.Word2Vec.load(gensim_model)
    embedding_matrix = np.zeros((max_num_words, embedding_dim))
    for word, i in tokenizer.word_index.items():
        if word in model.wv.vocab and i < max_num_words:
            embedding_vector = model.wv.vectors[model.wv.vocab[word].index]
            embedding_matrix[i] = embedding_vector
    return embedding_matrix
    

emb_weights = get_embedding_weights(gensim_model='emb_yelp/word2vec',
                                    tokenizer=tokenizer,
                                    max_num_words=MAX_NUM_WORDS,
                                    embedding_dim=EMBEDDING_DIM
                                   )

embedding_layer = Embedding(input_dim=MAX_NUM_WORDS,
                            output_dim=EMBEDDING_DIM,
                            input_length=MAX_SEQ_LENGTH,
                            weights=[emb_weights],
                            trainable=False
                           )

References

[1] Vector Representations of Words
[2] Embeddings

Author

Christopher Masch

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].