All Projects → jind11 → word2vec-on-wikipedia

jind11 / word2vec-on-wikipedia

Licence: MIT license
A pipeline for training word embeddings using word2vec on wikipedia corpus.

Programming Languages

python
139335 projects - #7 most used programming language
c
50402 projects - #5 most used programming language
shell
77523 projects
Makefile
30231 projects

Projects that are alternatives of or similar to word2vec-on-wikipedia

Dna2vec
dna2vec: Consistent vector representations of variable-length k-mers
Stars: ✭ 117 (+72.06%)
Mutual labels:  word2vec, word-embeddings
Debiaswe
Remove problematic gender bias from word embeddings.
Stars: ✭ 175 (+157.35%)
Mutual labels:  word2vec, word-embeddings
Scattertext
Beautiful visualizations of how language differs among document types.
Stars: ✭ 1,722 (+2432.35%)
Mutual labels:  word2vec, word-embeddings
Postgres Word2vec
utils to use word embedding like word2vec vectors in a postgres database
Stars: ✭ 96 (+41.18%)
Mutual labels:  word2vec, word-embeddings
Koan
A word2vec negative sampling implementation with correct CBOW update.
Stars: ✭ 232 (+241.18%)
Mutual labels:  word2vec, word-embeddings
Text Summarizer
Python Framework for Extractive Text Summarization
Stars: ✭ 96 (+41.18%)
Mutual labels:  word2vec, word-embeddings
Gensim
Topic Modelling for Humans
Stars: ✭ 12,763 (+18669.12%)
Mutual labels:  word2vec, word-embeddings
Text2vec
Fast vectorization, topic modeling, distances and GloVe word embeddings in R.
Stars: ✭ 715 (+951.47%)
Mutual labels:  word2vec, word-embeddings
Chameleon recsys
Source code of CHAMELEON - A Deep Learning Meta-Architecture for News Recommender Systems
Stars: ✭ 202 (+197.06%)
Mutual labels:  word2vec, word-embeddings
Shallowlearn
An experiment about re-implementing supervised learning models based on shallow neural network approaches (e.g. fastText) with some additional exclusive features and nice API. Written in Python and fully compatible with Scikit-learn.
Stars: ✭ 196 (+188.24%)
Mutual labels:  word2vec, word-embeddings
Dict2vec
Dict2vec is a framework to learn word embeddings using lexical dictionaries.
Stars: ✭ 91 (+33.82%)
Mutual labels:  word2vec, word-embeddings
Simple-Sentence-Similarity
Exploring the simple sentence similarity measurements using word embeddings
Stars: ✭ 99 (+45.59%)
Mutual labels:  word2vec, word-embeddings
Glove As A Tensorflow Embedding Layer
Taking a pretrained GloVe model, and using it as a TensorFlow embedding weight layer **inside the GPU**. Therefore, you only need to send the index of the words through the GPU data transfer bus, reducing data transfer overhead.
Stars: ✭ 85 (+25%)
Mutual labels:  word2vec, word-embeddings
Magnitude
A fast, efficient universal vector embedding utility package.
Stars: ✭ 1,394 (+1950%)
Mutual labels:  word2vec, word-embeddings
Word2vec Win32
A word2vec port for Windows.
Stars: ✭ 41 (-39.71%)
Mutual labels:  word2vec, word-embeddings
Fasttext.js
FastText for Node.js
Stars: ✭ 127 (+86.76%)
Mutual labels:  word2vec, word-embeddings
Wego
Word Embeddings (e.g. Word2Vec) in Go!
Stars: ✭ 336 (+394.12%)
Mutual labels:  word2vec, word-embeddings
Deep learning nlp
Keras, PyTorch, and NumPy Implementations of Deep Learning Architectures for NLP
Stars: ✭ 407 (+498.53%)
Mutual labels:  word2vec, word-embeddings
Germanwordembeddings
Toolkit to obtain and preprocess german corpora, train models using word2vec (gensim) and evaluate them with generated testsets
Stars: ✭ 189 (+177.94%)
Mutual labels:  word2vec, word-embeddings
Word2Vec-on-Wikipedia-Corpus
利用wikipedia中英文的語料訓練Word2vec模型
Stars: ✭ 18 (-73.53%)
Mutual labels:  wikipedia, word2vec

word2vec-on-wikipedia

A pipeline for training word embeddings using word2vec on wikipedia corpus.

How to use

Just run sudo sh run.sh, which will:

  • Download the latest English wikipedia dump
  • Extract and clean texts from the downloaded wikipedia dump
  • Pre-process the wikipedia corpus
  • Train word2vec model on the processed corpus to produce word embedding results

Details for each step will be discussed as below:

Wikipedia dump

All the latest English wikipedia can be downloaded from a Wikipedia database dump. Here I downloaded all the article pages:

curl -L -O "https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2"

Wikipedia dump extraction

The original wikipedia dump that can be downloaded is in xml format and the structure is quite complex. Thus we need to use a extractor tool to parse it. The one I used is from the wikiextractor repository. Only the file WikiExtractor.py is needed and the descriptions of parameters can be found the in the repository readme file. The output would be each article id and its name followed by the content in text format.

python WikiExtractor.py enwiki-latest-pages-articles.xml.bz2 -b 1G -o extracted --no-template --processes 24

Text pre-processing

Before the word2vec training, the corpus needs to be pre-processed, which bascially includes: sentence spltting, sentence tokenization, removing sentences that contain less than 20 characters or 5 tokens, and converting all numerals to 0. For example, "1993" would be converted into "0000".

python wiki-corpus-prepare.py extracted/wiki processed/wiki

Here I used Stanford CoreNLP toolkit 3.8.0 for sentence tokenization. To use it, we need to set up a local server within the downloaded toolkit folder:

java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000

In the script wiki-corpus-prepare.py, I used a python wrapper of the Stanford CoreNLP server so that we can manipulate the java server in python script.

Word2vec training

Once we get the processed wikipedia corpus ready, we can start the word2vec training. Here I used the Google word2vec tool, which is pretty standard and efficient. The tool is alrady in this repository, but in case you want to download the original one, you can find it here.

./word2vec -train ../processed/wiki -output ../results/enwiki.skip.size300.win10.neg15.sample1e-5.min10.bin -cbow 0 -size 300 -window 10 -negative 15 -hs 0 -sample 1e-5 -threads 24 -binary 1 -min-count 10

Evaluation of word embeddings

After we well-train the word embeddings, we always want to evaluate the performance for quality check. Here I used the word relation test set described in Efficient Estimation of Word Representations in Vector Space for performance test.

./compute-accuracy ../results/enwiki.skip.size300.win10.neg15.sample1e-5.min15.bin < questions-words.txt

In my experiments, the vocabulary of word embeddings I obtained is 833,976 and the token number of the corpus is 2,333,367,969. I generated several word embedding files for different vector sizes: 50, 100, 200, 300 and 400. For each file, I provide the downloadable link and its word relation test performance in the following table:

vector size Word relation test performance (%)
50 47.33
100 54.94
200 69.41
300 71.29
400 71.80

As you can see, the vector size can influence the word relation test performance, and within a certain range, the larger vector size, the better performance.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].