zake7749 / Word2vec Tutorial
Licence: mit
中文詞向量訓練教學
Stars: ✭ 426
Programming Languages
python
139335 projects - #7 most used programming language
Projects that are alternatives of or similar to Word2vec Tutorial
Gemsec
The TensorFlow reference implementation of 'GEMSEC: Graph Embedding with Self Clustering' (ASONAM 2019).
Stars: ✭ 210 (-50.7%)
Mutual labels: word2vec, gensim
Lmdb Embeddings
Fast word vectors with little memory usage in Python
Stars: ✭ 404 (-5.16%)
Mutual labels: word2vec, gensim
Aravec
AraVec is a pre-trained distributed word representation (word embedding) open source project which aims to provide the Arabic NLP research community with free to use and powerful word embedding models.
Stars: ✭ 239 (-43.9%)
Mutual labels: word2vec, gensim
Splitter
A Pytorch implementation of "Splitter: Learning Node Representations that Capture Multiple Social Contexts" (WWW 2019).
Stars: ✭ 177 (-58.45%)
Mutual labels: word2vec, gensim
RolX
An alternative implementation of Recursive Feature and Role Extraction (KDD11 & KDD12)
Stars: ✭ 52 (-87.79%)
Mutual labels: word2vec, gensim
Germanwordembeddings
Toolkit to obtain and preprocess german corpora, train models using word2vec (gensim) and evaluate them with generated testsets
Stars: ✭ 189 (-55.63%)
Mutual labels: word2vec, gensim
Word2VecAndTsne
Scripts demo-ing how to train a Word2Vec model and reduce its vector space
Stars: ✭ 45 (-89.44%)
Mutual labels: word2vec, gensim
Wordembeddings Elmo Fasttext Word2vec
Using pre trained word embeddings (Fasttext, Word2Vec)
Stars: ✭ 146 (-65.73%)
Mutual labels: word2vec, gensim
walklets
A lightweight implementation of Walklets from "Don't Walk Skip! Online Learning of Multi-scale Network Embeddings" (ASONAM 2017).
Stars: ✭ 94 (-77.93%)
Mutual labels: word2vec, gensim
biovec
ProtVec can be used in protein interaction predictions, structure prediction, and protein data visualization.
Stars: ✭ 23 (-94.6%)
Mutual labels: word2vec, gensim
Log Anomaly Detector
Log Anomaly Detection - Machine learning to detect abnormal events logs
Stars: ✭ 169 (-60.33%)
Mutual labels: word2vec, gensim
Product-Categorization-NLP
Multi-Class Text Classification for products based on their description with Machine Learning algorithms and Neural Networks (MLP, CNN, Distilbert).
Stars: ✭ 30 (-92.96%)
Mutual labels: word2vec, gensim
Shallowlearn
An experiment about re-implementing supervised learning models based on shallow neural network approaches (e.g. fastText) with some additional exclusive features and nice API. Written in Python and fully compatible with Scikit-learn.
Stars: ✭ 196 (-53.99%)
Mutual labels: word2vec, gensim
Webvectors
Web-ify your word2vec: framework to serve distributional semantic models online
Stars: ✭ 154 (-63.85%)
Mutual labels: word2vec, gensim
word-embeddings-from-scratch
Creating word embeddings from scratch and visualize them on TensorBoard. Using trained embeddings in Keras.
Stars: ✭ 22 (-94.84%)
Mutual labels: word2vec, gensim
Role2vec
A scalable Gensim implementation of "Learning Role-based Graph Embeddings" (IJCAI 2018).
Stars: ✭ 134 (-68.54%)
Mutual labels: word2vec, gensim
Turkish Word2vec
Pre-trained Word2Vec Model for Turkish
Stars: ✭ 136 (-68.08%)
Mutual labels: word2vec, gensim
doc2vec-api
document embedding and machine learning script for beginners
Stars: ✭ 92 (-78.4%)
Mutual labels: word2vec, gensim
word2vec-pt-br
Implementação e modelo gerado com o treinamento (trigram) da wikipedia em pt-br
Stars: ✭ 34 (-92.02%)
Mutual labels: word2vec, gensim
使用 gensim 訓練中文詞向量
教學文件
套件需求
- jieba
pip3 install jieba
- gensim
pip3 install -U gensim
- OpenCC (可更換為任何繁簡轉換套件)
訓練流程
1.取得中文維基數據,本次實驗是採用 2016/8/20 的資料。
目前 8 月 20 號的備份已經被汰換掉囉,請前往維基百科:資料庫下載按日期來挑選更新的訓練資料。( 請挑選以pages-articles.xml.bz2
為結尾的檔案 )
2.將下載後的維基數據置於與專案同個目錄,再使用wiki_to_txt.py
從 xml 中提取出維基文章
python3 wiki_to_txt.py zhwiki-20160820-pages-articles.xml.bz2
若您採用的不是 8 月 20 號的備份,請更換 zhwiki-20160820-pages-articles.xml.bz2
為您採用的備份的檔名。
3.使用 OpenCC 將維基文章統一轉換為繁體中文
opencc -i wiki_texts.txt -o wiki_zh_tw.txt -c s2tw.json
4.使用jieba
對文本斷詞,並去除停用詞
python3 segment.py
5.使用gensim
的 word2vec 模型進行訓練
python3 train.py
6.測試我們訓練出的模型
python3 demo.py
Note that the project description data, including the texts, logos, images, and/or trademarks,
for each open source project belongs to its rightful owner.
If you wish to add or remove any projects, please contact us at [email protected].