EternalFeather / Word2Vec-on-Wikipedia-Corpus

Licence: other

利用wikipedia中英文的語料訓練Word2vec模型

Programming Languages

python

139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Word2Vec-on-Wikipedia-Corpus

word2vec-on-wikipedia

A pipeline for training word embeddings using word2vec on wikipedia corpus.

Stars: ✭ 68 (+277.78%)

Mutual labels: wikipedia, word2vec

text classifier

Tensorflow2.3的文本分类项目，支持各种分类模型，支持相关tricks。

Stars: ✭ 135 (+650%)

Mutual labels: word2vec

biovec

ProtVec can be used in protein interaction predictions, structure prediction, and protein data visualization.

Stars: ✭ 23 (+27.78%)

Mutual labels: word2vec

pageviews.js

A lightweight JavaScript client library for the Wikimedia Pageviews API for Wikipedia and various of its sister projects for Node.js and the browser.

Stars: ✭ 24 (+33.33%)

Mutual labels: wikipedia

walklets

A lightweight implementation of Walklets from "Don't Walk Skip! Online Learning of Multi-scale Network Embeddings" (ASONAM 2017).

Stars: ✭ 94 (+422.22%)

Mutual labels: word2vec

GraphDBLP

a Graph-based instance of DBLP

Stars: ✭ 33 (+83.33%)

Mutual labels: word2vec

NCE-loss

Tensorflow NCE loss in Keras

Stars: ✭ 30 (+66.67%)

Mutual labels: word2vec

text-mining-corona-articles

Text Mining for Indonesian Online News Articles About Corona

Stars: ✭ 15 (-16.67%)

Mutual labels: word2vec

wikidata-corpus

Train Wikidata with word2vec for word embedding tasks

Stars: ✭ 109 (+505.56%)

Mutual labels: word2vec

word2vec pipeline

NLP pipeline using word2vec (preprocessing/embedding/prediction/clustering)

Stars: ✭ 108 (+500%)

Mutual labels: word2vec

ratewithscience

Rate things on arbitrary scales using big data and science!

Stars: ✭ 42 (+133.33%)

Mutual labels: wikipedia

RolX

An alternative implementation of Recursive Feature and Role Extraction (KDD11 & KDD12)

Stars: ✭ 52 (+188.89%)

Mutual labels: word2vec

oabot

Adding links to full text in Wikipedia references

Stars: ✭ 33 (+83.33%)

Mutual labels: wikipedia

linkcount

Web program to see the number of links to a page in any Wikimedia project.

Stars: ✭ 26 (+44.44%)

Mutual labels: wikipedia

cyber-matrix-ai

Collection of cyber security and "AI" relevant topics

Stars: ✭ 69 (+283.33%)

Mutual labels: word2vec

DiscordWikiBot

Discord bot for Wikimedia projects and MediaWiki wiki sites

Stars: ✭ 30 (+66.67%)

Mutual labels: wikipedia

word2vec-tsne

Google News and Leo Tolstoy: Visualizing Word2Vec Word Embeddings using t-SNE.

Stars: ✭ 59 (+227.78%)

Mutual labels: word2vec

lda2vec

Mixing Dirichlet Topic Models and Word Embeddings to Make lda2vec from this paper https://arxiv.org/abs/1605.02019

Stars: ✭ 27 (+50%)

Mutual labels: word2vec

word2vec-pt-br

Implementação e modelo gerado com o treinamento (trigram) da wikipedia em pt-br

Stars: ✭ 34 (+88.89%)

Mutual labels: word2vec

sentiment-analysis-of-tweets-in-russian

Sentiment analysis of tweets in Russian using Convolutional Neural Networks (CNN) with Word2Vec embeddings.

Stars: ✭ 51 (+183.33%)

Mutual labels: word2vec

View All Similar Projects ➔

Word2Vec on Wikipedia

我们知道语言在人际交往当中充当了重要的角色，理解语言的编码就能够了解对方所要表达的意思。而机器不同于人，无法从繁杂的文字当中快速提取有用的信息，因此需要借助一个能够代表文字语言的编码单位，也就是我们说的向量（Vector）。因此训练Word2Vec的模型，用来计算词语之间的相似度似乎成为了解决文字编码问题的不可或缺的重要途径之一。

配置需求

Python3
Gensim >= 2.3.0 (沒試過更低的版本)
Opencc
jieba

模型训练语料

维基百科官方提供了大约11G的很好的英文語料：開源數據鏈接。
同時也提供了大約1.5G的中文語料：開源數據鏈接。

其主要的文档格式以 .xml 为主。

操作流程

資料前處理

前處理第一階段需要將wiki的 .xml 格式的數據轉換成 text 格式的數據:

通過 word2vec_process.py 實現，基本參數包括：
- -data：輸入的維基百科數據集。
- -output：輸出的文件位置和名稱。

python word2vec_process.py -data enwiki-latest-pages-articles.xml.bz2 -output wiki.en.text

Tips:

如果是中文維基百科的語料訓練時，會存在一些繁體和簡體混雜的中文字，如果想要統一字體格式，就可以使用Opencc將字體進行轉換：

opencc -i wiki.zh.text -o wiki.zh.text.jianti -c zht2zhs.ini

中文的維基百科數據接下來就是需要進行斷詞處理了，這裏使用的中文斷詞工具是 jieba。

這裏利用了gensim裏面處理維基百科的class WikiCorpus，通過 get_texts function將每篇文章換行輸出成text文本，並且已經完成了去標點的工作。運行之後就能夠得到英文維基百科的數據文檔 wiki.en.text(參數可自行設定名稱)。

模型訓練

有了文章的text數據集之後，無論是word2vec binary版本還是gensim的word2vec，都可以用來訓練我們的模型，不過後者的運算速度比較快。

模型的建立通過 word2vec_model.py 實現，基本參數包括：
- -text：輸入的維基百科文字檔名稱。
- -vector：輸出的向量文檔存儲位置和名稱（默認爲 wiki.en.text.vector）。
- -core：多進程運行使用的cpu數量（默認爲全部）。

python word2vec_model.py -text wiki.en.text -vector wiki.en.text.vector -core 8

模型測試

訓練結束之後就能得到一個gensim原始c版本的word2vec的vector格式的模型，這時候我們就可以利用這些模型進行一些文字的評估測試了：

導入模型進行操作通過 word2vec_eval.py 實現，基本參數包括：
- -vector：載入的模型位置和名稱。
- -mode：想要執行模型的功能名稱（包括 similar*【預測相關的words】、similarity【判斷兩個words的相似度】等）

python word2vec_eval.py -vector wiki.en.text.vector -mode similarity

Reference

我愛自然語言處理

KeyWords

Tags: `Word2Vec` `Embedding`

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

EternalFeather / Word2Vec-on-Wikipedia-Corpus

Programming Languages

Labels

Projects that are alternatives of or similar to Word2Vec-on-Wikipedia-Corpus

Word2Vec on Wikipedia

配置需求

模型训练语料

操作流程

資料前處理

模型訓練

模型測試

Reference

KeyWords

Tags: `Word2Vec` `Embedding`

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

EternalFeather / Word2Vec-on-Wikipedia-Corpus

Programming Languages

Labels

Projects that are alternatives of or similar to Word2Vec-on-Wikipedia-Corpus

Word2Vec on Wikipedia

配置需求

模型训练语料

操作流程

資料前處理

模型訓練

模型測試

Reference

KeyWords

Tags: Word2Vec Embedding

Tags: `Word2Vec` `Embedding`