All Projects → zake7749 → Word2vec Tutorial

zake7749 / Word2vec Tutorial

Licence: mit
中文詞向量訓練教學

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Word2vec Tutorial

Gemsec
The TensorFlow reference implementation of 'GEMSEC: Graph Embedding with Self Clustering' (ASONAM 2019).
Stars: ✭ 210 (-50.7%)
Mutual labels:  word2vec, gensim
Lmdb Embeddings
Fast word vectors with little memory usage in Python
Stars: ✭ 404 (-5.16%)
Mutual labels:  word2vec, gensim
Aravec
AraVec is a pre-trained distributed word representation (word embedding) open source project which aims to provide the Arabic NLP research community with free to use and powerful word embedding models.
Stars: ✭ 239 (-43.9%)
Mutual labels:  word2vec, gensim
Splitter
A Pytorch implementation of "Splitter: Learning Node Representations that Capture Multiple Social Contexts" (WWW 2019).
Stars: ✭ 177 (-58.45%)
Mutual labels:  word2vec, gensim
RolX
An alternative implementation of Recursive Feature and Role Extraction (KDD11 & KDD12)
Stars: ✭ 52 (-87.79%)
Mutual labels:  word2vec, gensim
Germanwordembeddings
Toolkit to obtain and preprocess german corpora, train models using word2vec (gensim) and evaluate them with generated testsets
Stars: ✭ 189 (-55.63%)
Mutual labels:  word2vec, gensim
Word2VecAndTsne
Scripts demo-ing how to train a Word2Vec model and reduce its vector space
Stars: ✭ 45 (-89.44%)
Mutual labels:  word2vec, gensim
Wordembeddings Elmo Fasttext Word2vec
Using pre trained word embeddings (Fasttext, Word2Vec)
Stars: ✭ 146 (-65.73%)
Mutual labels:  word2vec, gensim
walklets
A lightweight implementation of Walklets from "Don't Walk Skip! Online Learning of Multi-scale Network Embeddings" (ASONAM 2017).
Stars: ✭ 94 (-77.93%)
Mutual labels:  word2vec, gensim
biovec
ProtVec can be used in protein interaction predictions, structure prediction, and protein data visualization.
Stars: ✭ 23 (-94.6%)
Mutual labels:  word2vec, gensim
Log Anomaly Detector
Log Anomaly Detection - Machine learning to detect abnormal events logs
Stars: ✭ 169 (-60.33%)
Mutual labels:  word2vec, gensim
Product-Categorization-NLP
Multi-Class Text Classification for products based on their description with Machine Learning algorithms and Neural Networks (MLP, CNN, Distilbert).
Stars: ✭ 30 (-92.96%)
Mutual labels:  word2vec, gensim
Gensim
Topic Modelling for Humans
Stars: ✭ 12,763 (+2896.01%)
Mutual labels:  word2vec, gensim
Shallowlearn
An experiment about re-implementing supervised learning models based on shallow neural network approaches (e.g. fastText) with some additional exclusive features and nice API. Written in Python and fully compatible with Scikit-learn.
Stars: ✭ 196 (-53.99%)
Mutual labels:  word2vec, gensim
Webvectors
Web-ify your word2vec: framework to serve distributional semantic models online
Stars: ✭ 154 (-63.85%)
Mutual labels:  word2vec, gensim
word-embeddings-from-scratch
Creating word embeddings from scratch and visualize them on TensorBoard. Using trained embeddings in Keras.
Stars: ✭ 22 (-94.84%)
Mutual labels:  word2vec, gensim
Role2vec
A scalable Gensim implementation of "Learning Role-based Graph Embeddings" (IJCAI 2018).
Stars: ✭ 134 (-68.54%)
Mutual labels:  word2vec, gensim
Turkish Word2vec
Pre-trained Word2Vec Model for Turkish
Stars: ✭ 136 (-68.08%)
Mutual labels:  word2vec, gensim
doc2vec-api
document embedding and machine learning script for beginners
Stars: ✭ 92 (-78.4%)
Mutual labels:  word2vec, gensim
word2vec-pt-br
Implementação e modelo gerado com o treinamento (trigram) da wikipedia em pt-br
Stars: ✭ 34 (-92.02%)
Mutual labels:  word2vec, gensim

使用 gensim 訓練中文詞向量

教學文件

套件需求

  • jieba
pip3 install jieba
  • gensim
pip3 install -U gensim
  • OpenCC (可更換為任何繁簡轉換套件)

訓練流程

1.取得中文維基數據,本次實驗是採用 2016/8/20 的資料。

目前 8 月 20 號的備份已經被汰換掉囉,請前往維基百科:資料庫下載按日期來挑選更新的訓練資料。( 請挑選以pages-articles.xml.bz2為結尾的檔案 )

2.將下載後的維基數據置於與專案同個目錄,再使用wiki_to_txt.py從 xml 中提取出維基文章

python3 wiki_to_txt.py zhwiki-20160820-pages-articles.xml.bz2

若您採用的不是 8 月 20 號的備份,請更換 zhwiki-20160820-pages-articles.xml.bz2 為您採用的備份的檔名。

3.使用 OpenCC 將維基文章統一轉換為繁體中文

opencc -i wiki_texts.txt -o wiki_zh_tw.txt -c s2tw.json

4.使用jieba 對文本斷詞,並去除停用詞

python3 segment.py

5.使用gensim 的 word2vec 模型進行訓練

python3 train.py

6.測試我們訓練出的模型

python3 demo.py
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].