All Projects → akoksal → Turkish Word2vec

akoksal / Turkish Word2vec

Licence: mit
Pre-trained Word2Vec Model for Turkish

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Turkish Word2vec

wordfish-python
extract relationships from standardized terms from corpus of interest with deep learning 🐟
Stars: ✭ 19 (-86.03%)
Mutual labels:  word2vec, gensim
Ml Projects
ML based projects such as Spam Classification, Time Series Analysis, Text Classification using Random Forest, Deep Learning, Bayesian, Xgboost in Python
Stars: ✭ 127 (-6.62%)
Mutual labels:  word2vec, gensim
Lmdb Embeddings
Fast word vectors with little memory usage in Python
Stars: ✭ 404 (+197.06%)
Mutual labels:  word2vec, gensim
RolX
An alternative implementation of Recursive Feature and Role Extraction (KDD11 & KDD12)
Stars: ✭ 52 (-61.76%)
Mutual labels:  word2vec, gensim
Sense2vec
🦆 Contextually-keyed word vectors
Stars: ✭ 1,184 (+770.59%)
Mutual labels:  word2vec, gensim
word2vec-pt-br
Implementação e modelo gerado com o treinamento (trigram) da wikipedia em pt-br
Stars: ✭ 34 (-75%)
Mutual labels:  word2vec, gensim
Nlp In Practice
Starter code to solve real world text data problems. Includes: Gensim Word2Vec, phrase embeddings, Text Classification with Logistic Regression, word count with pyspark, simple text preprocessing, pre-trained embeddings and more.
Stars: ✭ 790 (+480.88%)
Mutual labels:  word2vec, gensim
doc2vec-api
document embedding and machine learning script for beginners
Stars: ✭ 92 (-32.35%)
Mutual labels:  word2vec, gensim
Word2vec
訓練中文詞向量 Word2vec, Word2vec was created by a team of researchers led by Tomas Mikolov at Google.
Stars: ✭ 48 (-64.71%)
Mutual labels:  word2vec, gensim
Tadw
An implementation of "Network Representation Learning with Rich Text Information" (IJCAI '15).
Stars: ✭ 43 (-68.38%)
Mutual labels:  word2vec, gensim
Magnitude
A fast, efficient universal vector embedding utility package.
Stars: ✭ 1,394 (+925%)
Mutual labels:  word2vec, gensim
Nlp Journey
Documents, papers and codes related to Natural Language Processing, including Topic Model, Word Embedding, Named Entity Recognition, Text Classificatin, Text Generation, Text Similarity, Machine Translation),etc. All codes are implemented intensorflow 2.0.
Stars: ✭ 1,290 (+848.53%)
Mutual labels:  word2vec, gensim
walklets
A lightweight implementation of Walklets from "Don't Walk Skip! Online Learning of Multi-scale Network Embeddings" (ASONAM 2017).
Stars: ✭ 94 (-30.88%)
Mutual labels:  word2vec, gensim
Product-Categorization-NLP
Multi-Class Text Classification for products based on their description with Machine Learning algorithms and Neural Networks (MLP, CNN, Distilbert).
Stars: ✭ 30 (-77.94%)
Mutual labels:  word2vec, gensim
biovec
ProtVec can be used in protein interaction predictions, structure prediction, and protein data visualization.
Stars: ✭ 23 (-83.09%)
Mutual labels:  word2vec, gensim
Word2vec Tutorial
中文詞向量訓練教學
Stars: ✭ 426 (+213.24%)
Mutual labels:  word2vec, gensim
word-embeddings-from-scratch
Creating word embeddings from scratch and visualize them on TensorBoard. Using trained embeddings in Keras.
Stars: ✭ 22 (-83.82%)
Mutual labels:  word2vec, gensim
Word2VecAndTsne
Scripts demo-ing how to train a Word2Vec model and reduce its vector space
Stars: ✭ 45 (-66.91%)
Mutual labels:  word2vec, gensim
Twitter sentiment analysis word2vec convnet
Twitter Sentiment Analysis with Gensim Word2Vec and Keras Convolutional Network
Stars: ✭ 24 (-82.35%)
Mutual labels:  word2vec, gensim
Musae
The reference implementation of "Multi-scale Attributed Node Embedding".
Stars: ✭ 75 (-44.85%)
Mutual labels:  word2vec, gensim

Turkish Pre-trained Word2Vec Model

(Turkish version is below. / Türkçe için aşağıya bakın.)

This tutorial introduces how to train word2vec model for Turkish language from Wikipedia dump. This code is written in Python 3 by using gensim library. Turkish is an agglutinative language and there are many words with the same lemma and different suffixes in the wikipedia corpus. I will write Turkish lemmatizer to increase quality of the model.

You can checkout wiki-page for more details. If you just want to download the pretrained model you can use this link and you can look for examples in 5. Using Word2Vec Model and Examples page in github wiki. Some of them are below:

word_vectors.most_similar(positive=["kral","kadın"],negative=["erkek"])

This is a classic example for word2vec. The most similar word vector for king+woman-man is queen as expected. Second one is "of king(kralı)", third one is "king's(kralın)". If the model was trained with lemmatization tool for Turkish language, the results would be more clear.



word_vectors.most_similar(positive=["geliyor","gitmek"],negative=["gelmek"])

Turkish is an aggluginative language. I have investigated this property. I analyzed most similar vector for +geliyor(he/she/it is coming)-gelmek(to come)+gitmek(to go). Most similar vector is gidiyor(he/she/it is going) as expected. Second one is "I am going". Third one is "lets go". So, we can see effects of tense and possesive suffixes in word2vec models.

Eğitilmiş Türkçe Word2Vec Modeli

Bu çalışma Wikipedia'daki Türkçe makalelerden Türkçe word2vec modelinin nasıl çıkarılabileceğini anlatmak için yapılmıştır. Kod gensim kütüphanesi kullanılarak Python 3 ile yazılmıştır. Gelecek zamanlarda, Türkçe "lemmatization" algoritmasıyla aynı kök ve yapım ekleri fakat farklı çekim eklerine sahip kelimelerin aynı kelimeye işaret etmesi sağlanarak modelin kalitesi arttırılacaktır.

Ayrıntılar için github wiki sayfasını ziyaret edebilirsiniz. Eğer sadece eğitilmiş modeli kullanmak isterseniz buradan indirebilirsiniz. Aynı zamanda örneklere bakmak için github wikisinde bulunan 5. Word2Vec Modelini Kullanmak/Örnekler sayfasına bakabilirsiniz. Bazı örnekler aşağıda mevcuttur:

word_vectors.most_similar(positive=["kral","kadın"],negative=["erkek"])

Bu word2vec için klasik bir örnektir. Kral kelime vektöründen erkek kelime vektörü çıkarılıp kadın eklendiğinde en yakın kelime vektörü kraliçe oluyor. Benzerlerin bir çoğu da kral ve kraliçenin ek almış halleri oluyor. Türkçe sondan eklemeli bir dil olduğu için bazı sonuçlar beklenildiği gibi çıkmayabiliyor. Eğer word2vec'i kelimelerin lemmalarını bularak eğitebilseydik, çok daha temiz sonuçlar elde edebilirdik.



word_vectors.most_similar(positive=["geliyor","gitmek"],negative=["gelmek"])

Bu örnekte ise filler için zaman eklerinin etkisini inceledik. En benzer kelime vektörleri beklenen sonuç ile alakalı çıktı.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].