Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → jind11 → word2vec-on-wikipedia

jind11 / word2vec-on-wikipedia

Licence: MIT license

A pipeline for training word embeddings using word2vec on wikipedia corpus.

Programming Languages

139335 projects - #7 most used programming language

50402 projects - #5 most used programming language

77523 projects

30231 projects

Labels

wikipedia word2vec word-embeddings

Projects that are alternatives of or similar to word2vec-on-wikipedia

dna2vec: Consistent vector representations of variable-length k-mers

Stars: ✭ 117 (+72.06%)

Mutual labels: word2vec, word-embeddings

Remove problematic gender bias from word embeddings.

Stars: ✭ 175 (+157.35%)

Mutual labels: word2vec, word-embeddings

Beautiful visualizations of how language differs among document types.

Stars: ✭ 1,722 (+2432.35%)

Mutual labels: word2vec, word-embeddings

Postgres Word2vec

utils to use word embedding like word2vec vectors in a postgres database

Stars: ✭ 96 (+41.18%)

Mutual labels: word2vec, word-embeddings

A word2vec negative sampling implementation with correct CBOW update.

Stars: ✭ 232 (+241.18%)

Mutual labels: word2vec, word-embeddings

Text Summarizer

Python Framework for Extractive Text Summarization

Stars: ✭ 96 (+41.18%)

Mutual labels: word2vec, word-embeddings

Topic Modelling for Humans

Stars: ✭ 12,763 (+18669.12%)

Mutual labels: word2vec, word-embeddings

Fast vectorization, topic modeling, distances and GloVe word embeddings in R.

Stars: ✭ 715 (+951.47%)

Mutual labels: word2vec, word-embeddings

Chameleon recsys

Source code of CHAMELEON - A Deep Learning Meta-Architecture for News Recommender Systems

Stars: ✭ 202 (+197.06%)

Mutual labels: word2vec, word-embeddings

An experiment about re-implementing supervised learning models based on shallow neural network approaches (e.g. fastText) with some additional exclusive features and nice API. Written in Python and fully compatible with Scikit-learn.

Stars: ✭ 196 (+188.24%)

Mutual labels: word2vec, word-embeddings

Dict2vec is a framework to learn word embeddings using lexical dictionaries.

Stars: ✭ 91 (+33.82%)

Mutual labels: word2vec, word-embeddings

Simple-Sentence-Similarity

Exploring the simple sentence similarity measurements using word embeddings

Stars: ✭ 99 (+45.59%)

Mutual labels: word2vec, word-embeddings

Glove As A Tensorflow Embedding Layer

Taking a pretrained GloVe model, and using it as a TensorFlow embedding weight layer **inside the GPU**. Therefore, you only need to send the index of the words through the GPU data transfer bus, reducing data transfer overhead.

Stars: ✭ 85 (+25%)

Mutual labels: word2vec, word-embeddings

A fast, efficient universal vector embedding utility package.

Stars: ✭ 1,394 (+1950%)

Mutual labels: word2vec, word-embeddings

A word2vec port for Windows.

Stars: ✭ 41 (-39.71%)

Mutual labels: word2vec, word-embeddings

FastText for Node.js

Stars: ✭ 127 (+86.76%)

Mutual labels: word2vec, word-embeddings

Word Embeddings (e.g. Word2Vec) in Go!

Stars: ✭ 336 (+394.12%)

Mutual labels: word2vec, word-embeddings

Deep learning nlp

Keras, PyTorch, and NumPy Implementations of Deep Learning Architectures for NLP

Stars: ✭ 407 (+498.53%)

Mutual labels: word2vec, word-embeddings

Germanwordembeddings

Toolkit to obtain and preprocess german corpora, train models using word2vec (gensim) and evaluate them with generated testsets

Stars: ✭ 189 (+177.94%)

Mutual labels: word2vec, word-embeddings

Word2Vec-on-Wikipedia-Corpus

利用wikipedia中英文的語料訓練Word2vec模型

Stars: ✭ 18 (-73.53%)

Mutual labels: wikipedia, word2vec

View All Similar Projects ➔

word2vec-on-wikipedia

A pipeline for training word embeddings using word2vec on wikipedia corpus.

How to use

Just run sudo sh run.sh, which will:

Download the latest English wikipedia dump
Extract and clean texts from the downloaded wikipedia dump
Pre-process the wikipedia corpus
Train word2vec model on the processed corpus to produce word embedding results

Details for each step will be discussed as below:

Wikipedia dump

All the latest English wikipedia can be downloaded from a Wikipedia database dump. Here I downloaded all the article pages:

curl -L -O "https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2"

Wikipedia dump extraction

The original wikipedia dump that can be downloaded is in xml format and the structure is quite complex. Thus we need to use a extractor tool to parse it. The one I used is from the wikiextractor repository. Only the file WikiExtractor.py is needed and the descriptions of parameters can be found the in the repository readme file. The output would be each article id and its name followed by the content in text format.

python WikiExtractor.py enwiki-latest-pages-articles.xml.bz2 -b 1G -o extracted --no-template --processes 24

Text pre-processing

Before the word2vec training, the corpus needs to be pre-processed, which bascially includes: sentence spltting, sentence tokenization, removing sentences that contain less than 20 characters or 5 tokens, and converting all numerals to 0. For example, "1993" would be converted into "0000".

python wiki-corpus-prepare.py extracted/wiki processed/wiki

Here I used Stanford CoreNLP toolkit 3.8.0 for sentence tokenization. To use it, we need to set up a local server within the downloaded toolkit folder:

java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000

In the script wiki-corpus-prepare.py, I used a python wrapper of the Stanford CoreNLP server so that we can manipulate the java server in python script.

Word2vec training

Once we get the processed wikipedia corpus ready, we can start the word2vec training. Here I used the Google word2vec tool, which is pretty standard and efficient. The tool is alrady in this repository, but in case you want to download the original one, you can find it here.

./word2vec -train ../processed/wiki -output ../results/enwiki.skip.size300.win10.neg15.sample1e-5.min10.bin -cbow 0 -size 300 -window 10 -negative 15 -hs 0 -sample 1e-5 -threads 24 -binary 1 -min-count 10

Evaluation of word embeddings

After we well-train the word embeddings, we always want to evaluate the performance for quality check. Here I used the word relation test set described in Efficient Estimation of Word Representations in Vector Space for performance test.

./compute-accuracy ../results/enwiki.skip.size300.win10.neg15.sample1e-5.min15.bin < questions-words.txt

In my experiments, the vocabulary of word embeddings I obtained is 833,976 and the token number of the corpus is 2,333,367,969. I generated several word embedding files for different vector sizes: 50, 100, 200, 300 and 400. For each file, I provide the downloadable link and its word relation test performance in the following table:

vector size	Word relation test performance (%)
50	47.33
100	54.94
200	69.41
300	71.29
400	71.80

As you can see, the vector size can influence the word relation test performance, and within a certain range, the larger vector size, the better performance.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 68

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (1) 🔗