All Projects → hgrif → wiki-word2vec

hgrif / wiki-word2vec

Licence: MIT License
Train a gensim word2vec model on Wikipedia.

Programming Languages

python
139335 projects - #7 most used programming language
Makefile
30231 projects

Wiki Word2vec

Train a gensim word2vec model on Wikipedia.

Most of it is taken from this blogpost and this discussion. This repository was created mostly for trying out make, see The gist for the important stuff. Note that performance depends heavily on corpus size and chosen parameters (especially for smaller corpora). Examples and parameters below are cherry-picked.

Usage

Get the code for a language (see here).

Run make with the code as the value for LANGUAGE (or change the Makefile). For instance, try Swahili (sw):

make LANGUAGE=sw

The gist

Ignore make and execute the following bash commands for Swahili:

mkdir -p data/sw/
wget -P data/sw/ https://dumps.wikimedia.org/swwiki/latest/swwiki-latest-pages-articles.xml.bz2

Train a model in Python:

import multiprocessing
from gensim.corpora.wikicorpus import WikiCorpus
from gensim.models.word2vec import Word2Vec

wiki = WikiCorpus('data/sw/swwiki-latest-pages-articles.xml.bz2', 
                  lemmatize=False, dictionary={})
sentences = list(wiki.get_texts())
params = {'size': 200, 'window': 10, 'min_count': 10, 
          'workers': max(1, multiprocessing.cpu_count() - 1), 'sample': 1E-3,}
word2vec = Word2Vec(sentences, **params)

Example 1

Try the old man:king woman:? problem:

female_king = word2vec.most_similar_cosmul(positive='mfalme mwanamke'.split(), 
                                           negative='mtu'.split(), topn=5,)
for ii, (word, score) in enumerate(female_king):
    print("{}. {} ({:1.2f})".format(ii+1, word, score))

1. malkia (0.97)
2. kambisi (0.93)
3. suleimani (0.93)
4. karolo (0.92)
5. koreshi (0.92)

Returning respectively queen (jackpot!), Cambyses II (a Persian king), Solomon (king of Israel), Karolo Mkuu? (Charlemagne?) and Cyrus (a Persian King),

Example 2

What doesn't match: car, train or breakfast?

print(word2vec.doesnt_match('gari treni mlo'.split()))

mlo

Dependencies

  • Python 3
  • pip install gensim
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].