Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → Alex-CHUN-YU → Word2vec

Alex-CHUN-YU / Word2vec

訓練中文詞向量 Word2vec, Word2vec was created by a team of researchers led by Tomas Mikolov at Google.

Labels

jupyter-notebook deep-learning machine-learning word2vec gensim wikidata jieba

Projects that are alternatives of or similar to Word2vec

Germanwordembeddings

Toolkit to obtain and preprocess german corpora, train models using word2vec (gensim) and evaluate them with generated testsets

Stars: ✭ 189 (+293.75%)

Mutual labels: jupyter-notebook, word2vec, gensim

Log Anomaly Detector

Log Anomaly Detection - Machine learning to detect abnormal events logs

Stars: ✭ 169 (+252.08%)

Mutual labels: jupyter-notebook, word2vec, gensim

Aravec

AraVec is a pre-trained distributed word representation (word embedding) open source project which aims to provide the Arabic NLP research community with free to use and powerful word embedding models.

Stars: ✭ 239 (+397.92%)

Mutual labels: jupyter-notebook, word2vec, gensim

Twitter sentiment analysis word2vec convnet

Twitter Sentiment Analysis with Gensim Word2Vec and Keras Convolutional Network

Stars: ✭ 24 (-50%)

Mutual labels: jupyter-notebook, word2vec, gensim

Nlp In Practice

Starter code to solve real world text data problems. Includes: Gensim Word2Vec, phrase embeddings, Text Classification with Logistic Regression, word count with pyspark, simple text preprocessing, pre-trained embeddings and more.

Stars: ✭ 790 (+1545.83%)

Mutual labels: jupyter-notebook, word2vec, gensim

wikidata-corpus

Train Wikidata with word2vec for word embedding tasks

Stars: ✭ 109 (+127.08%)

Mutual labels: wikidata, word2vec

word2vec-pt-br

Implementação e modelo gerado com o treinamento (trigram) da wikipedia em pt-br

Stars: ✭ 34 (-29.17%)

Mutual labels: word2vec, gensim

wordfish-python

extract relationships from standardized terms from corpus of interest with deep learning 🐟

Stars: ✭ 19 (-60.42%)

Mutual labels: word2vec, gensim

Deep learning nlp

Keras, PyTorch, and NumPy Implementations of Deep Learning Architectures for NLP

Stars: ✭ 407 (+747.92%)

Mutual labels: jupyter-notebook, word2vec

doc2vec-api

document embedding and machine learning script for beginners

Stars: ✭ 92 (+91.67%)

Mutual labels: word2vec, gensim

Text summurization abstractive methods

Multiple implementations for abstractive text summurization , using google colab

Stars: ✭ 359 (+647.92%)

Mutual labels: jupyter-notebook, word2vec

Word2vec Tutorial

中文詞向量訓練教學

Stars: ✭ 426 (+787.5%)

Mutual labels: word2vec, gensim

RolX

An alternative implementation of Recursive Feature and Role Extraction (KDD11 & KDD12)

Stars: ✭ 52 (+8.33%)

Mutual labels: word2vec, gensim

walklets

A lightweight implementation of Walklets from "Don't Walk Skip! Online Learning of Multi-scale Network Embeddings" (ASONAM 2017).

Stars: ✭ 94 (+95.83%)

Mutual labels: word2vec, gensim

Product-Categorization-NLP

Multi-Class Text Classification for products based on their description with Machine Learning algorithms and Neural Networks (MLP, CNN, Distilbert).

Stars: ✭ 30 (-37.5%)

Mutual labels: word2vec, gensim

biovec

ProtVec can be used in protein interaction predictions, structure prediction, and protein data visualization.

Stars: ✭ 23 (-52.08%)

Mutual labels: word2vec, gensim

Lmdb Embeddings

Fast word vectors with little memory usage in Python

Stars: ✭ 404 (+741.67%)

Mutual labels: word2vec, gensim

Crime Analysis

Association Rule Mining from Spatial Data for Crime Analysis

Stars: ✭ 20 (-58.33%)

Mutual labels: jupyter-notebook, gensim

Servenet

Service Classification based on Service Description

Stars: ✭ 21 (-56.25%)

Mutual labels: jupyter-notebook, word2vec

word-embeddings-from-scratch

Creating word embeddings from scratch and visualize them on TensorBoard. Using trained embeddings in Keras.

Stars: ✭ 22 (-54.17%)

Mutual labels: word2vec, gensim

View All Similar Projects ➔

Word2Vec

Word2vec 是基於非監督式學習，訓練集建議越大越好，語料涵蓋的越全面，訓練出來的結果相對比較好，當然也有可能 garbage input 進而得到 garbage output ，由於檔案所使用的資料集較大，所以每個過程中都請耐心等候。(ps: word2vec 如果把每種字當成一個維度，假設總共有 4000 個總字，那麼向量就會有 4000 維度。故可透過它來降低維度)

WiKi簡介: Word2vec is a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words. Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space. Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another in the space.

By the way : Skip-gram: works well with small amount of the training data, represents well even rare words or phrases. CBOW: several times faster to train than the skip-gram, slightly better accuracy for the frequent words.

使用方式

Input:

1.download wiki data(請參考資料集)
2.進入 Word2Vec 資料夾
3.執行 python wiki_to_txt.py zhwiki-latest-pages-articles.xml.bz2(wiki xml 轉換成 wiki text)
4.執行 python segmentation.py(簡體轉繁體，在進行斷詞並同步過濾停用詞，由於檔案較大故斷詞較久)
5.執行 python train.py(訓練並產生 model ，時間上也會比較久)
5.執行 python main.py(使用 Model，輸入詞彙)
註:如果在 Windows cmd 下執行 python 時有編碼問題請下以下指令:chcp 65001(使用utf-8)

Output:

1.輸入一個詞彙會找出前5名相似
2.輸入兩個詞彙會算出兩者之間相似度
3.輸入三個詞彙爸爸之於老公,如媽媽之於老婆

輸入格式( Ex: 爸爸,媽媽,....註:最多三個詞彙)
老師
詞彙相似詞前 5 排序
班導,0.6360481977462769
班導師,0.6360464096069336
代課,0.6358826160430908
級任,0.6271134614944458
班主任,0.6270170211791992

輸入格式( Ex: 爸爸,媽媽,....註:最多三個詞彙)
爸爸,媽媽
計算兩個詞彙間 Cosine 相似度
0.780765200371

輸入格式( Ex: 爸爸,媽媽,....註:最多三個詞彙)
爸爸,老公,媽媽
爸爸之於老公，如媽媽之於
老婆,0.5401346683502197
蠢萌,0.5245970487594604
夠秤,0.5059393048286438
駁命,0.4888317286968231
孔爵,0.4857243597507477

資料集(wiki data)

主要以 pages-articles.xml.bz2 結尾之檔案類型，這邊使用 zhwiki-latest-pages-articles.xml.bz2。

維基資料集: https://zh.wikipedia.org/wiki/Wikipedia:%E6%95%B0%E6%8D%AE%E5%BA%93%E4%B8%8B%E8%BD%BD
zhwiki-latest-pages-articles.xml.bz2 下載網址: https://drive.google.com/file/d/0B4rlWa2S_JMBUmlMSG5IRVRMbnc/view?usp=sharing
程式參考網址: https://radimrehurek.com/gensim/corpora/wikicorpus.html https://radimrehurek.com/gensim/models/word2vec.html

開發環境

Python 3.5.2 pip install gensim pip install jieba pip install hanziconv note: if your gensim can't install, please check your os then install correct gensim version.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 48

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (0) 🔗