Bayesian Skip-gram(BSG)

This repository contains Theano code for the Bayesian Skip-gram model, COLING 2018.

[1] Embedding Words as Distributions with a Bayesian Skip-gram Model, Arthur Bražinskas, Serhii Havrylov, Ivan Titov, arxiv

The model represents words are Gaussian distributions instead of point estimates, and is capable of learning addition word properties, such as generality that is encoded in variances. The instructions below provide a guide on how to install and run the model, also how to evaluate word pairs.

Requirements

Python 2.7
Theano 0.9.0
numpy 1.14.2
nltk 3.2.2
scipy 0.18.1
Lasagne 0.2.dev1

Installation

First of all, install the dependency Python modules, such as Theano and nltk.

pip install requirements.txt

Afterwards, install the necessary NLTK sub-packages.

python -m nltk.downloader wordnet

python -m nltk.downloader punkt

Runing the model

In order to run the model, please refer to run_bsg.py file that contains an example code on how to train and evaluate the model. Upon completion of training, word representations will be saved to the output folder. For example, one can use trained word Gaussian representations(mus and sigmas) as input to word pairs evaluation.

Data

A small dataset consisting of 15 million tokens dataset is available for smoke tests of the setup. Alternatively, a dataset consisting of approximately 1 billion tokens is also available for the public use. The dataset that was used originally in the research is not publicly available, but can be (requested)[http://wacky.sslmit.unibo.it/doku.php?id=corpora].

Word pairs evaluation

One can use the eval/word_pairs_eval.py console application as a playground for word pairs evaluation in terms of similarity, Kullback-Leibler divergence, and entailment directionality. The console application expects paths for word pairs, mu and sigma vectors(i.e. representations of word). A word pairs file should contain two words(order does not matter) per line separated by space. The latter two files are obtained from a trained BSG model. Alternative, pre-trained on the 3B tokens dataset word representations.

The example command below will evaluate pairs stored in eval/example_word_pairs.txt, and output results to the console.

python eval/word_pairs_eval.py -wpp eval/example_word_pairs.txt -mup vectors/mu.vectors -sigmap vectors/sigma.vectors

Additional resourced used in the project

Lexical substitution benchmark is a modified version of https://github.com/orenmel/lexsub

Citation

@inproceedings{brazinskas-etal-2018-embedding,
    title = "Embedding Words as Distributions with a {B}ayesian Skip-gram Model",
    author = "Bra{\v{z}}inskas, Arthur  and
      Havrylov, Serhii  and
      Titov, Ivan",
    booktitle = "Proceedings of the 27th International Conference on Computational Linguistics",
    month = aug,
    year = "2018",
    address = "Santa Fe, New Mexico, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/C18-1151",
    pages = "1775--1789",
}

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

abrazinskas / BSG

Programming Languages