All Projects → Alex-CHUN-YU → Word2vec

Alex-CHUN-YU / Word2vec

訓練中文詞向量 Word2vec, Word2vec was created by a team of researchers led by Tomas Mikolov at Google.

Projects that are alternatives of or similar to Word2vec

Germanwordembeddings
Toolkit to obtain and preprocess german corpora, train models using word2vec (gensim) and evaluate them with generated testsets
Stars: ✭ 189 (+293.75%)
Mutual labels:  jupyter-notebook, word2vec, gensim
Log Anomaly Detector
Log Anomaly Detection - Machine learning to detect abnormal events logs
Stars: ✭ 169 (+252.08%)
Mutual labels:  jupyter-notebook, word2vec, gensim
Aravec
AraVec is a pre-trained distributed word representation (word embedding) open source project which aims to provide the Arabic NLP research community with free to use and powerful word embedding models.
Stars: ✭ 239 (+397.92%)
Mutual labels:  jupyter-notebook, word2vec, gensim
Twitter sentiment analysis word2vec convnet
Twitter Sentiment Analysis with Gensim Word2Vec and Keras Convolutional Network
Stars: ✭ 24 (-50%)
Mutual labels:  jupyter-notebook, word2vec, gensim
Nlp In Practice
Starter code to solve real world text data problems. Includes: Gensim Word2Vec, phrase embeddings, Text Classification with Logistic Regression, word count with pyspark, simple text preprocessing, pre-trained embeddings and more.
Stars: ✭ 790 (+1545.83%)
Mutual labels:  jupyter-notebook, word2vec, gensim
wikidata-corpus
Train Wikidata with word2vec for word embedding tasks
Stars: ✭ 109 (+127.08%)
Mutual labels:  wikidata, word2vec
word2vec-pt-br
Implementação e modelo gerado com o treinamento (trigram) da wikipedia em pt-br
Stars: ✭ 34 (-29.17%)
Mutual labels:  word2vec, gensim
wordfish-python
extract relationships from standardized terms from corpus of interest with deep learning 🐟
Stars: ✭ 19 (-60.42%)
Mutual labels:  word2vec, gensim
Deep learning nlp
Keras, PyTorch, and NumPy Implementations of Deep Learning Architectures for NLP
Stars: ✭ 407 (+747.92%)
Mutual labels:  jupyter-notebook, word2vec
doc2vec-api
document embedding and machine learning script for beginners
Stars: ✭ 92 (+91.67%)
Mutual labels:  word2vec, gensim
Text summurization abstractive methods
Multiple implementations for abstractive text summurization , using google colab
Stars: ✭ 359 (+647.92%)
Mutual labels:  jupyter-notebook, word2vec
Word2vec Tutorial
中文詞向量訓練教學
Stars: ✭ 426 (+787.5%)
Mutual labels:  word2vec, gensim
RolX
An alternative implementation of Recursive Feature and Role Extraction (KDD11 & KDD12)
Stars: ✭ 52 (+8.33%)
Mutual labels:  word2vec, gensim
walklets
A lightweight implementation of Walklets from "Don't Walk Skip! Online Learning of Multi-scale Network Embeddings" (ASONAM 2017).
Stars: ✭ 94 (+95.83%)
Mutual labels:  word2vec, gensim
Product-Categorization-NLP
Multi-Class Text Classification for products based on their description with Machine Learning algorithms and Neural Networks (MLP, CNN, Distilbert).
Stars: ✭ 30 (-37.5%)
Mutual labels:  word2vec, gensim
biovec
ProtVec can be used in protein interaction predictions, structure prediction, and protein data visualization.
Stars: ✭ 23 (-52.08%)
Mutual labels:  word2vec, gensim
Lmdb Embeddings
Fast word vectors with little memory usage in Python
Stars: ✭ 404 (+741.67%)
Mutual labels:  word2vec, gensim
Crime Analysis
Association Rule Mining from Spatial Data for Crime Analysis
Stars: ✭ 20 (-58.33%)
Mutual labels:  jupyter-notebook, gensim
Servenet
Service Classification based on Service Description
Stars: ✭ 21 (-56.25%)
Mutual labels:  jupyter-notebook, word2vec
word-embeddings-from-scratch
Creating word embeddings from scratch and visualize them on TensorBoard. Using trained embeddings in Keras.
Stars: ✭ 22 (-54.17%)
Mutual labels:  word2vec, gensim

Word2Vec

Word2vec 是基於非監督式學習,訓練集建議越大越好,語料涵蓋的越全面,訓練出來的結果相對比較好,當然也有可能 garbage input 進而得到 garbage output ,由於檔案所使用的資料集較大,所以每個過程中都請耐心等候。(ps: word2vec 如果把每種字當成一個維度,假設總共有 4000 個總字,那麼向量就會有 4000 維度。故可透過它來降低維度)

demo WiKi簡介: Word2vec is a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words. Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space. Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another in the space.

By the way : Skip-gram: works well with small amount of the training data, represents well even rare words or phrases. CBOW: several times faster to train than the skip-gram, slightly better accuracy for the frequent words.

使用方式

Input:

1.download wiki data(請參考資料集)
2.進入 Word2Vec 資料夾
3.執行 python wiki_to_txt.py zhwiki-latest-pages-articles.xml.bz2(wiki xml 轉換成 wiki text)
4.執行 python segmentation.py(簡體轉繁體,在進行斷詞並同步過濾停用詞,由於檔案較大故斷詞較久)
5.執行 python train.py(訓練並產生 model ,時間上也會比較久)
5.執行 python main.py(使用 Model,輸入詞彙)
註:如果在 Windows cmd 下執行 python 時有編碼問題請下以下指令:chcp 65001(使用utf-8)

Output:

1.輸入一個詞彙會找出前5名相似
2.輸入兩個詞彙會算出兩者之間相似度
3.輸入三個詞彙爸爸之於老公,如媽媽之於老婆

輸入格式( Ex: 爸爸,媽媽,....註:最多三個詞彙)
老師
詞彙相似詞前 5 排序
班導,0.6360481977462769
班導師,0.6360464096069336
代課,0.6358826160430908
級任,0.6271134614944458
班主任,0.6270170211791992

輸入格式( Ex: 爸爸,媽媽,....註:最多三個詞彙)
爸爸,媽媽
計算兩個詞彙間 Cosine 相似度
0.780765200371

輸入格式( Ex: 爸爸,媽媽,....註:最多三個詞彙)
爸爸,老公,媽媽
爸爸之於老公,如媽媽之於
老婆,0.5401346683502197
蠢萌,0.5245970487594604
夠秤,0.5059393048286438
駁命,0.4888317286968231
孔爵,0.4857243597507477

資料集(wiki data)

主要以 pages-articles.xml.bz2 結尾之檔案類型,這邊使用 zhwiki-latest-pages-articles.xml.bz2。

開發環境

Python 3.5.2 pip install gensim pip install jieba pip install hanziconv note: if your gensim can't install, please check your os then install correct gensim version.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].