All Projects → ThoughtRiver → Lmdb Embeddings

ThoughtRiver / Lmdb Embeddings

Licence: gpl-3.0
Fast word vectors with little memory usage in Python

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Lmdb Embeddings

Magnitude
A fast, efficient universal vector embedding utility package.
Stars: ✭ 1,394 (+245.05%)
Mutual labels:  word2vec, embeddings, fasttext, gensim, glove
Finalfusion Rust
finalfusion embeddings in Rust
Stars: ✭ 35 (-91.34%)
Mutual labels:  word2vec, embeddings, fasttext, glove
Embedding As Service
One-Stop Solution to encode sentence to fixed length vectors from various embedding techniques
Stars: ✭ 151 (-62.62%)
Mutual labels:  word2vec, embeddings, fasttext, glove
Wordembeddings Elmo Fasttext Word2vec
Using pre trained word embeddings (Fasttext, Word2Vec)
Stars: ✭ 146 (-63.86%)
Mutual labels:  word2vec, fasttext, gensim, glove
lda2vec
Mixing Dirichlet Topic Models and Word Embeddings to Make lda2vec from this paper https://arxiv.org/abs/1605.02019
Stars: ✭ 27 (-93.32%)
Mutual labels:  text, word2vec, embeddings
Gensim
Topic Modelling for Humans
Stars: ✭ 12,763 (+3059.16%)
Mutual labels:  word2vec, fasttext, gensim
Persian-Sentiment-Analyzer
Persian sentiment analysis ( آناکاوی سهش های فارسی | تحلیل احساسات فارسی )
Stars: ✭ 30 (-92.57%)
Mutual labels:  word2vec, embeddings, fasttext
Cw2vec
cw2vec: Learning Chinese Word Embeddings with Stroke n-gram Information
Stars: ✭ 224 (-44.55%)
Mutual labels:  word2vec, embeddings, fasttext
Ml Ai Experiments
All my experiments with AI and ML
Stars: ✭ 107 (-73.51%)
Mutual labels:  word, embeddings, text
FUTURE
A private, free, open-source search engine built on a P2P network
Stars: ✭ 19 (-95.3%)
Mutual labels:  lmdb, gensim, glove
Simple-Sentence-Similarity
Exploring the simple sentence similarity measurements using word embeddings
Stars: ✭ 99 (-75.5%)
Mutual labels:  word2vec, glove, fasttext
word-embeddings-from-scratch
Creating word embeddings from scratch and visualize them on TensorBoard. Using trained embeddings in Keras.
Stars: ✭ 22 (-94.55%)
Mutual labels:  word2vec, embeddings, gensim
navec
Compact high quality word embeddings for Russian language
Stars: ✭ 118 (-70.79%)
Mutual labels:  word2vec, embeddings, glove
Shallowlearn
An experiment about re-implementing supervised learning models based on shallow neural network approaches (e.g. fastText) with some additional exclusive features and nice API. Written in Python and fully compatible with Scikit-learn.
Stars: ✭ 196 (-51.49%)
Mutual labels:  word2vec, fasttext, gensim
Nlp Journey
Documents, papers and codes related to Natural Language Processing, including Topic Model, Word Embedding, Named Entity Recognition, Text Classificatin, Text Generation, Text Similarity, Machine Translation),etc. All codes are implemented intensorflow 2.0.
Stars: ✭ 1,290 (+219.31%)
Mutual labels:  word2vec, fasttext, gensim
Ngram2vec
Four word embedding models implemented in Python. Supporting arbitrary context features
Stars: ✭ 703 (+74.01%)
Mutual labels:  word, word2vec, glove
Embeddingsviz
Visualize word embeddings of a vocabulary in TensorBoard, including the neighbors
Stars: ✭ 40 (-90.1%)
Mutual labels:  embeddings, fasttext, glove
Sensegram
Making sense embedding out of word embeddings using graph-based word sense induction
Stars: ✭ 209 (-48.27%)
Mutual labels:  word, word2vec, embeddings
NLP-paper
🎨 🎨NLP 自然语言处理教程 🎨🎨 https://dataxujing.github.io/NLP-paper/
Stars: ✭ 23 (-94.31%)
Mutual labels:  word2vec, glove, fasttext
Fast sentence embeddings
Compute Sentence Embeddings Fast!
Stars: ✭ 384 (-4.95%)
Mutual labels:  embeddings, fasttext, gensim

lmdb-embeddings

Build Status

LMDB Embeddings

Query word vectors (embeddings) very quickly with very little querying time overhead and far less memory usage than gensim or other equivalent solutions. This is made possible by Lightning Memory-Mapped Database.

Inspired by Delft. As explained in their readme, this approach permits us to have the pre-trained embeddings immediately "warm" (no load time), to free memory and to use any number of embeddings similtaneously with a very negligible impact on runtime when using SSD.

For instance, in a traditional approach glove-840B takes around 2 minutes to load and 4GB in memory. Managed with LMDB, glove-840B can be accessed immediately and takes only a couple MB in memory, for a negligible impact on runtime (around 1% slower).

Installation

pip install lmdb-embeddings

Reading vectors

from lmdb_embeddings.reader import LmdbEmbeddingsReader
from lmdb_embeddings.exceptions import MissingWordError

embeddings = LmdbEmbeddingsReader('/path/to/word/vectors/eg/GoogleNews-vectors-negative300')

try:
    vector = embeddings.get_word_vector('google')
except MissingWordError:
    # 'google' is not in the database.
    pass

Writing vectors

An example to write an LMDB vector file from a gensim model. As any iterator that yields word and vector pairs is supported, if you have the vectors in an alternative format then it is just a matter of altering the iter_embeddings method below appropriately.

I will be writing a CLI interface to convert standard formats soon.

from gensim.models.keyedvectors import KeyedVectors
from lmdb_embeddings.writer import LmdbEmbeddingsWriter


GOOGLE_NEWS_PATH = 'GoogleNews-vectors-negative300.bin.gz'
OUTPUT_DATABASE_FOLDER = 'GoogleNews-vectors-negative300'


print('Loading gensim model...')
gensim_model = KeyedVectors.load_word2vec_format(GOOGLE_NEWS_PATH, binary=True)


def iter_embeddings():
    for word in gensim_model.vocab.keys():
        yield word, gensim_model[word]

print('Writing vectors to a LMDB database...')

writer = LmdbEmbeddingsWriter(iter_embeddings()).write(OUTPUT_DATABASE_FOLDER)

# These vectors can now be loaded with the LmdbEmbeddingsReader.

LRU Cache

A reader with an LRU (Least Recently Used) cache is included. This will save the embeddings for the 50,000 most recently queried words and return the same object instead of querying the database each time. Its interface is the same as the standard reader. See functools.lru_cache in the standard library.

from lmdb_embeddings.reader import LruCachedLmdbEmbeddingsReader
from lmdb_embeddings.exceptions import MissingWordError

embeddings = LruCachedLmdbEmbeddingsReader('/path/to/word/vectors/eg/GoogleNews-vectors-negative300')

try:
    vector = embeddings.get_word_vector('google')
except MissingWordError:
    # 'google' is not in the database.
    pass

Customisation

By default, LMDB Embeddings uses pickle to serialize the vectors to bytes (optimized and pickled with the highest available protocol). However, it is very easy to use an alternative approach - simply inject the serializer and unserializer as callables into the LmdbEmbeddingsWriter and LmdbEmbeddingsReader.

A msgpack serializer is included and can be used in the same way.

from lmdb_embeddings.writer import LmdbEmbeddingsWriter
from lmdb_embeddings.serializers import MsgpackSerializer

writer = LmdbEmbeddingsWriter(
    iter_embeddings(),
    serializer=MsgpackSerializer().serialize
).write(OUTPUT_DATABASE_FOLDER)
from lmdb_embeddings.reader import LmdbEmbeddingsReader
from lmdb_embeddings.serializers import MsgpackSerializer

reader = LmdbEmbeddingsReader(
    OUTPUT_DATABASE_FOLDER,
    unserializer=MsgpackSerializer().unserialize
)

Running tests

pytest

Author

Contributing

Contributions, issues and feature requests are welcome!

Show your support

Give a ⭐️ if this project helped you!

License

Copyright © 2019 ThoughtRiver.
This project is GPL-3.0 licensed.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].