chatopera / wikidata-corpus

Licence: other

Train Wikidata with word2vec for word embedding tasks

Programming Languages

python

139335 projects - #7 most used programming language

shell

77523 projects

Projects that are alternatives of or similar to wikidata-corpus

Germanwordembeddings

Toolkit to obtain and preprocess german corpora, train models using word2vec (gensim) and evaluate them with generated testsets

Stars: ✭ 189 (+73.39%)

Mutual labels: word2vec, word-embeddings

lda2vec

Mixing Dirichlet Topic Models and Word Embeddings to Make lda2vec from this paper https://arxiv.org/abs/1605.02019

Stars: ✭ 27 (-75.23%)

Mutual labels: word2vec, word-embeddings

Shallowlearn

An experiment about re-implementing supervised learning models based on shallow neural network approaches (e.g. fastText) with some additional exclusive features and nice API. Written in Python and fully compatible with Scikit-learn.

Stars: ✭ 196 (+79.82%)

Mutual labels: word2vec, word-embeddings

Fasttext.js

FastText for Node.js

Stars: ✭ 127 (+16.51%)

Mutual labels: word2vec, word-embeddings

word2vec-tsne

Google News and Leo Tolstoy: Visualizing Word2Vec Word Embeddings using t-SNE.

Stars: ✭ 59 (-45.87%)

Mutual labels: word2vec, word-embeddings

Gensim

Topic Modelling for Humans

Stars: ✭ 12,763 (+11609.17%)

Mutual labels: word2vec, word-embeddings

Koan

A word2vec negative sampling implementation with correct CBOW update.

Stars: ✭ 232 (+112.84%)

Mutual labels: word2vec, word-embeddings

Text Summarizer

Python Framework for Extractive Text Summarization

Stars: ✭ 96 (-11.93%)

Mutual labels: word2vec, word-embeddings

two-stream-cnn

A two-stream convolutional neural network for learning abitrary similarity functions over two sets of training data

Stars: ✭ 24 (-77.98%)

Mutual labels: word2vec, word-embeddings

Simple-Sentence-Similarity

Exploring the simple sentence similarity measurements using word embeddings

Stars: ✭ 99 (-9.17%)

Mutual labels: word2vec, word-embeddings

Scattertext

Beautiful visualizations of how language differs among document types.

Stars: ✭ 1,722 (+1479.82%)

Mutual labels: word2vec, word-embeddings

word-benchmarks

Benchmarks for intrinsic word embeddings evaluation.

Stars: ✭ 45 (-58.72%)

Mutual labels: word2vec, word-embeddings

Dna2vec

dna2vec: Consistent vector representations of variable-length k-mers

Stars: ✭ 117 (+7.34%)

Mutual labels: word2vec, word-embeddings

Debiaswe

Remove problematic gender bias from word embeddings.

Stars: ✭ 175 (+60.55%)

Mutual labels: word2vec, word-embeddings

Magnitude

A fast, efficient universal vector embedding utility package.

Stars: ✭ 1,394 (+1178.9%)

Mutual labels: word2vec, word-embeddings

Chameleon recsys

Source code of CHAMELEON - A Deep Learning Meta-Architecture for News Recommender Systems

Stars: ✭ 202 (+85.32%)

Mutual labels: word2vec, word-embeddings

Dict2vec

Dict2vec is a framework to learn word embeddings using lexical dictionaries.

Stars: ✭ 91 (-16.51%)

Mutual labels: word2vec, word-embeddings

Postgres Word2vec

utils to use word embedding like word2vec vectors in a postgres database

Stars: ✭ 96 (-11.93%)

Mutual labels: word2vec, word-embeddings

Word2vec

訓練中文詞向量 Word2vec, Word2vec was created by a team of researchers led by Tomas Mikolov at Google.

Stars: ✭ 48 (-55.96%)

Mutual labels: wikidata, word2vec

word2vec-on-wikipedia

A pipeline for training word embeddings using word2vec on wikipedia corpus.

Stars: ✭ 68 (-37.61%)

Mutual labels: word2vec, word-embeddings

View All Similar Projects ➔

wikidata

wikidata.org

Download

STORE_PATH=data
DATA_URL=http://download.wikipedia.com/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2

cd $STORE_PATH
wget $DATA_URL

Extract articles

WikiExtractor.py -b 5000M \
    -o data/zhwiki-latest-pages-articles.extracted \
    data/zhwiki-latest-pages-articles.xml.bz2

繁体转简体

opencc -i data/zhwiki-latest-pages-articles.extracted/AA/wiki_00  \
    -o data/zhwiki-latest-pages-articles.0620.chs \
    -c t2s.json

Download t2s.json.

到此为止，已经完成了大部分繁简转换工作。

其他情况处理

维基百科使用的繁简转换方法是以词表为准，外加人工修正。人工修正之后的文字是这种格式，多数是为了解决各地术语名称不同的问题：

他的主要成就包括Emacs及後來的GNU Emacs，GNU C 編譯器及-{zh-hant:GNU 除錯器;zh-hans:GDB 调试器}-。

对付这种可以简单的使用正则表达式来解决。一般简体中文的限定词是zh-hans或zh-cn。

由于Wikipedia Extractor抽取正文时，会将有特殊标记的外文直接剔除，最后形成类似这样的正文：

西方语言中“数学”（；）一词源自于古希腊语的（）

虽然上面这句话是读不通的，但鉴于这种句子对我要处理的问题影响不大，就暂且忽略了。最后再将「」『』这些符号替换成引号，顺便删除空括号。

python2 fix_special_symbols.py data/zhwiki-latest-pages-articles.0620.chs

程序执行结束，输出: data/zhwiki-latest-pages-articles.0620.chs.normalized。

浏览文件

head data/zhwiki-latest-pages-articles.0620.chs.normalized

分词

执行脚本

export PYTHONIOENCODING="UTF-8"
python3 wordseg.py > data/zhwiki-latest-pages-articles.0620.chs.normalized.wordseg

word2vec

word2vec官方的实现。

./word2vec_c_format_train.sh

Usage of word2vec model

word2vec cli

distance, compute-accuracy, word-analogy

python

python3 word2vec_gensim_similarity.py

TF-IDF

plain code

train

python3 tfidf_plain.py

After running, dump words, weights and idf into pickle file.

adv version in sklearn

现在会有稀疏矩阵的问题，解决方案是使用限定的词汇表。

python3 tfidf_sklearn.py

关联项目

Synonyms

中文近义词库，Synonyms使用wikidata-corpus训练的词向量生成近义词表。

references

http://licstar.net/archives/328 http://licstar.net/archives/tag/wikipedia-extractor

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

chatopera / wikidata-corpus

Programming Languages

Labels

Projects that are alternatives of or similar to wikidata-corpus

wikidata

Download

Extract articles

繁体转简体

其他情况处理

浏览文件

分词

word2vec

Usage of word2vec model

TF-IDF

关联项目

Synonyms

references