All Projects → chatopera → wikidata-corpus

chatopera / wikidata-corpus

Licence: other
Train Wikidata with word2vec for word embedding tasks

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to wikidata-corpus

Germanwordembeddings
Toolkit to obtain and preprocess german corpora, train models using word2vec (gensim) and evaluate them with generated testsets
Stars: ✭ 189 (+73.39%)
Mutual labels:  word2vec, word-embeddings
lda2vec
Mixing Dirichlet Topic Models and Word Embeddings to Make lda2vec from this paper https://arxiv.org/abs/1605.02019
Stars: ✭ 27 (-75.23%)
Mutual labels:  word2vec, word-embeddings
Shallowlearn
An experiment about re-implementing supervised learning models based on shallow neural network approaches (e.g. fastText) with some additional exclusive features and nice API. Written in Python and fully compatible with Scikit-learn.
Stars: ✭ 196 (+79.82%)
Mutual labels:  word2vec, word-embeddings
Fasttext.js
FastText for Node.js
Stars: ✭ 127 (+16.51%)
Mutual labels:  word2vec, word-embeddings
word2vec-tsne
Google News and Leo Tolstoy: Visualizing Word2Vec Word Embeddings using t-SNE.
Stars: ✭ 59 (-45.87%)
Mutual labels:  word2vec, word-embeddings
Gensim
Topic Modelling for Humans
Stars: ✭ 12,763 (+11609.17%)
Mutual labels:  word2vec, word-embeddings
Koan
A word2vec negative sampling implementation with correct CBOW update.
Stars: ✭ 232 (+112.84%)
Mutual labels:  word2vec, word-embeddings
Text Summarizer
Python Framework for Extractive Text Summarization
Stars: ✭ 96 (-11.93%)
Mutual labels:  word2vec, word-embeddings
two-stream-cnn
A two-stream convolutional neural network for learning abitrary similarity functions over two sets of training data
Stars: ✭ 24 (-77.98%)
Mutual labels:  word2vec, word-embeddings
Simple-Sentence-Similarity
Exploring the simple sentence similarity measurements using word embeddings
Stars: ✭ 99 (-9.17%)
Mutual labels:  word2vec, word-embeddings
Scattertext
Beautiful visualizations of how language differs among document types.
Stars: ✭ 1,722 (+1479.82%)
Mutual labels:  word2vec, word-embeddings
word-benchmarks
Benchmarks for intrinsic word embeddings evaluation.
Stars: ✭ 45 (-58.72%)
Mutual labels:  word2vec, word-embeddings
Dna2vec
dna2vec: Consistent vector representations of variable-length k-mers
Stars: ✭ 117 (+7.34%)
Mutual labels:  word2vec, word-embeddings
Debiaswe
Remove problematic gender bias from word embeddings.
Stars: ✭ 175 (+60.55%)
Mutual labels:  word2vec, word-embeddings
Magnitude
A fast, efficient universal vector embedding utility package.
Stars: ✭ 1,394 (+1178.9%)
Mutual labels:  word2vec, word-embeddings
Chameleon recsys
Source code of CHAMELEON - A Deep Learning Meta-Architecture for News Recommender Systems
Stars: ✭ 202 (+85.32%)
Mutual labels:  word2vec, word-embeddings
Dict2vec
Dict2vec is a framework to learn word embeddings using lexical dictionaries.
Stars: ✭ 91 (-16.51%)
Mutual labels:  word2vec, word-embeddings
Postgres Word2vec
utils to use word embedding like word2vec vectors in a postgres database
Stars: ✭ 96 (-11.93%)
Mutual labels:  word2vec, word-embeddings
Word2vec
訓練中文詞向量 Word2vec, Word2vec was created by a team of researchers led by Tomas Mikolov at Google.
Stars: ✭ 48 (-55.96%)
Mutual labels:  wikidata, word2vec
word2vec-on-wikipedia
A pipeline for training word embeddings using word2vec on wikipedia corpus.
Stars: ✭ 68 (-37.61%)
Mutual labels:  word2vec, word-embeddings

chatoper banner

wikidata

wikidata.org

Download

STORE_PATH=data
DATA_URL=http://download.wikipedia.com/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2

cd $STORE_PATH
wget $DATA_URL

Extract articles

WikiExtractor.py -b 5000M \
    -o data/zhwiki-latest-pages-articles.extracted \
    data/zhwiki-latest-pages-articles.xml.bz2

繁体转简体

opencc -i data/zhwiki-latest-pages-articles.extracted/AA/wiki_00  \
    -o data/zhwiki-latest-pages-articles.0620.chs \
    -c t2s.json

Download t2s.json.

到此为止,已经完成了大部分繁简转换工作。

其他情况处理

  1. 维基百科使用的繁简转换方法是以词表为准,外加人工修正。人工修正之后的文字是这种格式,多数是为了解决各地术语名称不同的问题:

他的主要成就包括Emacs及後來的GNU Emacs,GNU C 編譯器及-{zh-hant:GNU 除錯器;zh-hans:GDB 调试器}-。

对付这种可以简单的使用正则表达式来解决。一般简体中文的限定词是zh-hans或zh-cn。

  1. 由于Wikipedia Extractor抽取正文时,会将有特殊标记的外文直接剔除,最后形成类似这样的正文:

西方语言中“数学”(;)一词源自于古希腊语的()

虽然上面这句话是读不通的,但鉴于这种句子对我要处理的问题影响不大,就暂且忽略了。最后再将「」『』这些符号替换成引号,顺便删除空括号。

python2 fix_special_symbols.py data/zhwiki-latest-pages-articles.0620.chs

程序执行结束,输出: data/zhwiki-latest-pages-articles.0620.chs.normalized

浏览文件

head data/zhwiki-latest-pages-articles.0620.chs.normalized

分词

  • 执行脚本
export PYTHONIOENCODING="UTF-8"
python3 wordseg.py > data/zhwiki-latest-pages-articles.0620.chs.normalized.wordseg

word2vec

word2vec官方的实现。

./word2vec_c_format_train.sh

Usage of word2vec model

  • word2vec cli
distance, compute-accuracy, word-analogy
  • python
python3 word2vec_gensim_similarity.py

TF-IDF

  • plain code

train

python3 tfidf_plain.py

After running, dump words, weights and idf into pickle file.

现在会有稀疏矩阵的问题,解决方案是使用限定的词汇表。

python3 tfidf_sklearn.py

关联项目

Synonyms

中文近义词库,Synonyms使用wikidata-corpus训练的词向量生成近义词表。

references

http://licstar.net/archives/328 http://licstar.net/archives/tag/wikipedia-extractor

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].