All Projects → tannerbohn → char2vec

tannerbohn / char2vec

Licence: other
Like word2vec, except for letters of the alphabet.

Programming Languages

python
139335 projects - #7 most used programming language

char2vec

This code implements the skip-gram algorithm to find vector representations for the letters of the alphabet, as opposed to words as is done in word2vec. It does this by taking a body of text (stored in /data) and training a shallow neural network to predict characters c_(n-1) and c_(n+1) given c_n. In this implementation, c_n is represented as a one-hot encoding, mapped to a hidden layer, and then mapped to two output layers (one each for c_(n-1) and c_(n+1)), with categorical cross-entropy losses.

The result this algorithm has is that characters which appear in similar contexts will have similar encodings. For example, vowels often appear in similar contexts, so we would expect them to have similar encodings. Unlike the word2vec case where it is easy to conceive of what king-man+woman = queen means, I find it harder to interpret m-z+t = w.

example_embeddings

Requirements

This code is written in Python and requires Keras.

Usage

$ python main.py

When the code is run, it will convert the entire text file to training data (watch out for RAM usage) and then train the model. Since the number of classes is quite small, the network should converge quite quickly. Next, the encodings for the characters will be generated and plotted.

Additional Notes

The hidden layer/encoding is currently 2-D. This makes it easier to visualize without having to use techniques such as PCA or t-SNE.

The code currently uses window sizes of width 3 (c_(n-1:n+1)). There are several lines commented out which allow this width to be increased.

I have found that the text source can result in slightly different embeddings. Though for the same body of text, the embeddings learned between trials are very similar, up to rotation and flipping.

Something fun to try: instead of using a tanh activation in the hidden layer, use softmax with an encoding dimension << #chars -- this should allow you to come up with approximate classifications of the letters of the alphabet. This could also be achieved with clustering and the tanh activation... but this alternative approach seems more fun.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].