Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → bloomberg → Koan

bloomberg / Koan

Licence: apache-2.0

A word2vec negative sampling implementation with correct CBOW update.

Programming Languages

1120 projects

Labels

word2vec word-embeddings

Projects that are alternatives of or similar to Koan

Chameleon recsys

Source code of CHAMELEON - A Deep Learning Meta-Architecture for News Recommender Systems

Stars: ✭ 202 (-12.93%)

Mutual labels: word2vec, word-embeddings

Text Summarizer

Python Framework for Extractive Text Summarization

Stars: ✭ 96 (-58.62%)

Mutual labels: word2vec, word-embeddings

Dict2vec is a framework to learn word embeddings using lexical dictionaries.

Stars: ✭ 91 (-60.78%)

Mutual labels: word2vec, word-embeddings

Fast vectorization, topic modeling, distances and GloVe word embeddings in R.

Stars: ✭ 715 (+208.19%)

Mutual labels: word2vec, word-embeddings

Beautiful visualizations of how language differs among document types.

Stars: ✭ 1,722 (+642.24%)

Mutual labels: word2vec, word-embeddings

A word2vec port for Windows.

Stars: ✭ 41 (-82.33%)

Mutual labels: word2vec, word-embeddings

An experiment about re-implementing supervised learning models based on shallow neural network approaches (e.g. fastText) with some additional exclusive features and nice API. Written in Python and fully compatible with Scikit-learn.

Stars: ✭ 196 (-15.52%)

Mutual labels: word2vec, word-embeddings

SIGIR 2017: Embedding-based query expansion for weighted sequential dependence retrieval model

Stars: ✭ 35 (-84.91%)

Mutual labels: word2vec, word-embeddings

Germanwordembeddings

Toolkit to obtain and preprocess german corpora, train models using word2vec (gensim) and evaluate them with generated testsets

Stars: ✭ 189 (-18.53%)

Mutual labels: word2vec, word-embeddings

dna2vec: Consistent vector representations of variable-length k-mers

Stars: ✭ 117 (-49.57%)

Mutual labels: word2vec, word-embeddings

Deep learning nlp

Keras, PyTorch, and NumPy Implementations of Deep Learning Architectures for NLP

Stars: ✭ 407 (+75.43%)

Mutual labels: word2vec, word-embeddings

Topic Modelling for Humans

Stars: ✭ 12,763 (+5401.29%)

Mutual labels: word2vec, word-embeddings

Word Embeddings (e.g. Word2Vec) in Go!

Stars: ✭ 336 (+44.83%)

Mutual labels: word2vec, word-embeddings

Glove As A Tensorflow Embedding Layer

Taking a pretrained GloVe model, and using it as a TensorFlow embedding weight layer **inside the GPU**. Therefore, you only need to send the index of the words through the GPU data transfer bus, reducing data transfer overhead.

Stars: ✭ 85 (-63.36%)

Mutual labels: word2vec, word-embeddings

Explaining textual analysis tools in Python. Including Preprocessing, Skip Gram (word2vec), and Topic Modelling.

Stars: ✭ 48 (-79.31%)

Mutual labels: word2vec, word-embeddings

Postgres Word2vec

utils to use word embedding like word2vec vectors in a postgres database

Stars: ✭ 96 (-58.62%)

Mutual labels: word2vec, word-embeddings

Codenames AI using Word Vectors

Stars: ✭ 41 (-82.33%)

Mutual labels: word2vec, word-embeddings

Sample code for training Word2Vec and FastText using wiki corpus and their pretrained word embedding..

Stars: ✭ 21 (-90.95%)

Mutual labels: word2vec, word-embeddings

A fast, efficient universal vector embedding utility package.

Stars: ✭ 1,394 (+500.86%)

Mutual labels: word2vec, word-embeddings

FastText for Node.js

Stars: ✭ 127 (-45.26%)

Mutual labels: word2vec, word-embeddings

View All Similar Projects ➔

... the Zen attitude is that words and truth are incompatible, or at least that no words can capture truth.

Douglas R. Hofstadter

A word2vec negative sampling implementation with correct CBOW update. kōan only depends on Eigen.

Authors: Ozan Irsoy, Adrian Benton, Karl Stratos

Thanks to Cyril Khazan for helping kōan better scale to many threads.

Menu

Rationale
Building
Quick start
Installation
License

Rationale

Although continuous bag of word (CBOW) embeddings can be trained more quickly than skipgram (SG) embeddings, it is a common belief that SG embeddings tend to perform better in practice. This was observed by the original authors of Word2Vec [1] and also in subsequent work [2]. However, we found that popular implementations of word2vec with negative sampling such as word2vec and gensim do not implement the CBOW update correctly, thus potentially leading to misconceptions about the performance of CBOW embeddings when trained correctly.

We release kōan so that others can efficiently train CBOW embeddings using the corrected weight update. See this technical report for benchmarks of kōan vs. gensim word2vec negative sampling implementations. If you use kōan to learn word embeddings for your own work, please cite:

Ozan İrsoy, Adrian Benton, and Karl Stratos. "kōan: A Corrected CBOW Implementation." arXiv preprint arXiv:2012.15332 (2020).

[1] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.

[2] Karl Stratos, Michael Collins, and Daniel Hsu. Model-based word embeddings from decompositions of count matrices. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1282–1291, 2015.

Building

You need a C++17 supporting compiler to build koan (tested with g++ 7.5.0, 8.4.0, 9.3.0, and clang 11.0.3).

To build koan and all tests:

mkdir build
cd build
cmake ..
cmake --build ./

Run tests with (assuming you are still under build):

./test_gradcheck
./test_utils

Installation

Installation is as simple as placing the koan binary on your PATH (you might need sudo):

cmake --install ./

Quick Start

To train word embeddings on Wikitext-2, first clone and build koan:

git clone --recursive [email protected]:bloomberg/koan.git
cd koan
mkdir build
cd build
cmake .. && cmake --build ./
cd ..

Download and unzip the Wikitext-2 corpus:

curl https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-v1.zip --output wikitext-2-v1.zip
unzip wikitext-2-v1.zip
head -n 5 ./wikitext-2/wiki.train.tokens

And learn CBOW embeddings on the training fold with:

./build/koan -V 2000000 \
             --epochs 10 \
             --dim 300 \
             --negatives 5 \
             --context-size 5 \
             -l 0.075 \
             --threads 16 \
             --cbow true \
             --min-count 2 \
             --file ./wikitext-2/wiki.train.tokens

or skipgram embeddings by running with --cbow false. ./build/koan --help for a full list of command-line arguments and descriptions. Learned embeddings will be saved to embeddings_${CURRENT_TIMESTAMP}.txt in the present working directory.

License

Please read the LICENSE file.

Benchmarks

See report for details.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 232

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (0) 🔗