cmasch / word-embeddings-from-scratch

Licence: other

Creating word embeddings from scratch and visualize them on TensorBoard. Using trained embeddings in Keras.

Programming Languages

11667 projects

Projects that are alternatives of or similar to word-embeddings-from-scratch

ML based projects such as Spam Classification, Time Series Analysis, Text Classification using Random Forest, Deep Learning, Bayesian, Xgboost in Python

Stars: ✭ 127 (+477.27%)

Mutual labels: word2vec, gensim, tensorboard

Lmdb Embeddings

Fast word vectors with little memory usage in Python

Stars: ✭ 404 (+1736.36%)

Mutual labels: word2vec, embeddings, gensim

Magnitude

A fast, efficient universal vector embedding utility package.

Stars: ✭ 1,394 (+6236.36%)

Mutual labels: word2vec, embeddings, gensim

Embedding As Service

One-Stop Solution to encode sentence to fixed length vectors from various embedding techniques

Stars: ✭ 151 (+586.36%)

Mutual labels: word2vec, embeddings

Turkish Word2vec

Pre-trained Word2Vec Model for Turkish

Stars: ✭ 136 (+518.18%)

Mutual labels: word2vec, gensim

Wordembeddings Elmo Fasttext Word2vec

Using pre trained word embeddings (Fasttext, Word2Vec)

Stars: ✭ 146 (+563.64%)

Mutual labels: word2vec, gensim

Awesome Embedding Models

A curated list of awesome embedding models tutorials, projects and communities.

Stars: ✭ 1,486 (+6654.55%)

Mutual labels: word2vec, embeddings

Entity2rec

entity2rec generates item recommendation using property-specific knowledge graph embeddings

Stars: ✭ 159 (+622.73%)

Mutual labels: word2vec, embeddings

Webvectors

Web-ify your word2vec: framework to serve distributional semantic models online

Stars: ✭ 154 (+600%)

Mutual labels: word2vec, gensim

Log Anomaly Detector

Log Anomaly Detection - Machine learning to detect abnormal events logs

Stars: ✭ 169 (+668.18%)

Mutual labels: word2vec, gensim

Shallowlearn

An experiment about re-implementing supervised learning models based on shallow neural network approaches (e.g. fastText) with some additional exclusive features and nice API. Written in Python and fully compatible with Scikit-learn.

Stars: ✭ 196 (+790.91%)

Mutual labels: word2vec, gensim

Role2vec

A scalable Gensim implementation of "Learning Role-based Graph Embeddings" (IJCAI 2018).

Stars: ✭ 134 (+509.09%)

Mutual labels: word2vec, gensim

Germanwordembeddings

Toolkit to obtain and preprocess german corpora, train models using word2vec (gensim) and evaluate them with generated testsets

Stars: ✭ 189 (+759.09%)

Mutual labels: word2vec, gensim

Sensegram

Making sense embedding out of word embeddings using graph-based word sense induction

Stars: ✭ 209 (+850%)

Mutual labels: word2vec, embeddings

TF2DeepFloorplan

TF2 Deep FloorPlan Recognition using a Multi-task Network with Room-boundary-Guided Attention. Enable tensorboard, quantization, flask, tflite, docker, github actions and google colab.

Stars: ✭ 98 (+345.45%)

Mutual labels: tensorboard, tensorflow2

Dna2vec

dna2vec: Consistent vector representations of variable-length k-mers

Stars: ✭ 117 (+431.82%)

Mutual labels: word2vec, embeddings

Gensim

Topic Modelling for Humans

Stars: ✭ 12,763 (+57913.64%)

Mutual labels: word2vec, gensim

Dict2vec

Dict2vec is a framework to learn word embeddings using lexical dictionaries.

Stars: ✭ 91 (+313.64%)

Mutual labels: word2vec, embeddings

Cw2vec

cw2vec: Learning Chinese Word Embeddings with Stroke n-gram Information

Stars: ✭ 224 (+918.18%)

Mutual labels: word2vec, embeddings

Splitter

A Pytorch implementation of "Splitter: Learning Node Representations that Capture Multiple Social Contexts" (WWW 2019).

Stars: ✭ 177 (+704.55%)

Mutual labels: word2vec, gensim

View All Similar Projects ➔

Word embeddings from scratch and visualization

If you are working with documents one approach is to create word embeddings that allows to represent words with similar meaning.

* UPDATE * - February 18th, 2020

Updated the code to work with TensorFlow 2. Fix for the deprecation warning will coming soon.

In this jupyter notebook I would like to show how you can create embeddings from scratch using gensim and visualize them on TensorBoard in a simple way.
Some time ago I tried the build-in method word2vec2tensor of gensim to use TensorBoard, but without success. Therefore I implemented this version in combination with TensorFlow.

For this example I used a subset of 200000 documents of the Yelp dataset. This is a great dataset that included different languages but mostly english reviews.

As you can see in my animation, it learns the representation of similiar words from scratch. German and other languages are also included!

You can improve the results by tuning some parameters of word2vec, using t-SNE or modifying the preprocessing.

Usage

Because of the dataset license I can't publish my training data nor the trained embeddings. Feel free to use the notebook for your own dataset or request the data on Yelp. Just put your text-files in the defined directory TEXT_DIR. Everything will be saved in folder defined by MODEL_PATH.

Finally start TensorBoard:

tensorboard --logdir emb_yelp/

Using trained embeddings in Keras

If you would like to use your own trained embeddings for neural networks, you can load the trained weights (vectors) in an embedding layer (e.g. Keras). This is really useful, especially if you have just a few samples to train your network on. Another reason is that existing pre-trained models like Google word2vec or GloVe are maybe not sufficient because they are not task-specific embeddings.

If you need an example how to use trained embeddings of gensim in Keras, please take a look at the code snippet below. This is similiar to this jupyter notebook where I used GloVe. But loading gensim weights is quite a bit different.

def get_embedding_weights(gensim_model, tokenizer, max_num_words, embedding_dim):
    model = gensim.models.Word2Vec.load(gensim_model)
    embedding_matrix = np.zeros((max_num_words, embedding_dim))
    for word, i in tokenizer.word_index.items():
        if word in model.wv.vocab and i < max_num_words:
            embedding_vector = model.wv.vectors[model.wv.vocab[word].index]
            embedding_matrix[i] = embedding_vector
    return embedding_matrix
    

emb_weights = get_embedding_weights(gensim_model='emb_yelp/word2vec',
                                    tokenizer=tokenizer,
                                    max_num_words=MAX_NUM_WORDS,
                                    embedding_dim=EMBEDDING_DIM
                                   )

embedding_layer = Embedding(input_dim=MAX_NUM_WORDS,
                            output_dim=EMBEDDING_DIM,
                            input_length=MAX_SEQ_LENGTH,
                            weights=[emb_weights],
                            trainable=False
                           )

References

[1] Vector Representations of Words
[2] Embeddings

Author

Christopher Masch

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

cmasch / word-embeddings-from-scratch

Programming Languages

Labels

Projects that are alternatives of or similar to word-embeddings-from-scratch

Word embeddings from scratch and visualization

* UPDATE * - February 18th, 2020

Usage

Using trained embeddings in Keras

References

Author

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

cmasch / word-embeddings-from-scratch

Programming Languages

Labels

Projects that are alternatives of or similar to word-embeddings-from-scratch

Word embeddings from scratch and visualization

*** UPDATE *** - February 18th, 2020

Usage

Using trained embeddings in Keras

References

Author

* UPDATE * - February 18th, 2020