Kyubyong / Wordvectors

Licence: mit

Pre-trained word vectors of 30+ languages

Programming Languages

python

139335 projects - #7 most used programming language

shell

77523 projects

Projects that are alternatives of or similar to Wordvectors

Gensim

Topic Modelling for Humans

Stars: ✭ 12,763 (+524.72%)

Mutual labels: word2vec, fasttext

Vectorsinsearch

Dice.com repo to accompany the dice.com 'Vectors in Search' talk by Simon Hughes, from the Activate 2018 search conference, and the 'Searching with Vectors' talk from Haystack 2019 (US). Builds upon my conceptual search and semantic search work from 2015

Stars: ✭ 71 (-96.52%)

Mutual labels: vector, word2vec

Lmdb Embeddings

Fast word vectors with little memory usage in Python

Stars: ✭ 404 (-80.23%)

Mutual labels: word2vec, fasttext

Persian-Sentiment-Analyzer

Persian sentiment analysis ( آناکاوی سهش های فارسی | تحلیل احساسات فارسی )

Stars: ✭ 30 (-98.53%)

Mutual labels: word2vec, fasttext

Fasttext.js

FastText for Node.js

Stars: ✭ 127 (-93.78%)

Mutual labels: word2vec, fasttext

Vectorhub

Vector Hub - Library for easy discovery, and consumption of State-of-the-art models to turn data into vectors. (text2vec, image2vec, video2vec, graph2vec, bert, inception, etc)

Stars: ✭ 317 (-84.48%)

Mutual labels: vector, word2vec

Finalfusion Rust

finalfusion embeddings in Rust

Stars: ✭ 35 (-98.29%)

Mutual labels: word2vec, fasttext

Simple-Sentence-Similarity

Exploring the simple sentence similarity measurements using word embeddings

Stars: ✭ 99 (-95.15%)

Mutual labels: word2vec, fasttext

Nlp

兜哥出品 <一本开源的NLP入门书籍>

Stars: ✭ 1,677 (-17.91%)

Mutual labels: word2vec, fasttext

Magnitude

A fast, efficient universal vector embedding utility package.

Stars: ✭ 1,394 (-31.77%)

Mutual labels: word2vec, fasttext

word embedding

Sample code for training Word2Vec and FastText using wiki corpus and their pretrained word embedding..

Stars: ✭ 21 (-98.97%)

Mutual labels: word2vec, fasttext

Wordembeddings Elmo Fasttext Word2vec

Using pre trained word embeddings (Fasttext, Word2Vec)

Stars: ✭ 146 (-92.85%)

Mutual labels: word2vec, fasttext

Embedding

Embedding模型代码和学习笔记总结

Stars: ✭ 25 (-98.78%)

Mutual labels: word2vec, fasttext

Embedding As Service

One-Stop Solution to encode sentence to fixed length vectors from various embedding techniques

Stars: ✭ 151 (-92.61%)

Mutual labels: word2vec, fasttext

NLP-paper

🎨 🎨NLP 自然语言处理教程 🎨🎨 https://dataxujing.github.io/NLP-paper/

Stars: ✭ 23 (-98.87%)

Mutual labels: word2vec, fasttext

Neural Networks

All about Neural Networks!

Stars: ✭ 34 (-98.34%)

Mutual labels: word2vec, fasttext

Shallowlearn

An experiment about re-implementing supervised learning models based on shallow neural network approaches (e.g. fastText) with some additional exclusive features and nice API. Written in Python and fully compatible with Scikit-learn.

Stars: ✭ 196 (-90.41%)

Mutual labels: word2vec, fasttext

Cw2vec

cw2vec: Learning Chinese Word Embeddings with Stroke n-gram Information

Stars: ✭ 224 (-89.04%)

Mutual labels: word2vec, fasttext

Nlp Journey

Documents, papers and codes related to Natural Language Processing, including Topic Model, Word Embedding, Named Entity Recognition, Text Classificatin, Text Generation, Text Similarity, Machine Translation)，etc. All codes are implemented intensorflow 2.0.

Stars: ✭ 1,290 (-36.86%)

Mutual labels: word2vec, fasttext

Nlp research

NLP research：基于tensorflow的nlp深度学习项目，支持文本分类/句子匹配/序列标注/文本生成四大任务

Stars: ✭ 141 (-93.1%)

Mutual labels: word2vec, fasttext

View All Similar Projects ➔

Pre-trained word vectors of 30+ languages

This project has two purposes. First of all, I'd like to share some of my experience in nlp tasks such as segmentation or word vectors. The other, which is more important, is that probably some people are searching for pre-trained word vector models for non-English languages. Alas! English has gained much more attention than any other languages has done. Check this to see how easily you can get a variety of pre-trained English word vectors without efforts. I think it's time to turn our eyes to a multi language version of this.

Nearing the end of the work, I happened to know that there is already a similar job named polyglot. I strongly encourage you to check this great project. How embarrassing! Nevertheless, I decided to open this project. You will know that my job has its own flavor, after all.

Requirements

nltk >= 1.11.1
regex >= 2016.6.24
lxml >= 3.3.3
numpy >= 1.11.2
konlpy >= 0.4.4 (Only for Korean)
mecab (Only for Japanese)
pythai >= 0.1.3 (Only for Thai)
pyvi >= 0.0.7.2 (Only for Vietnamese)
jieba >= 0.38 (Only for Chinese)
gensim > =0.13.1 (for Word2Vec)
fastText (for fasttext)

Background / References

Check this to know what word embedding is.
Check this to quickly get a picture of Word2vec.
Check this to install fastText.
Watch this to really understand what's happening under the hood of Word2vec.
Go get various English word vectors here if needed.

Work Flow

STEP 1. Download the wikipedia database backup dumps of the language you want.
STEP 2. Extract running texts to data/ folder.
STEP 3. Run build_corpus.py.
STEP 4-1. Run make_wordvector.sh to get Word2Vec word vectors.
STEP 4-2. Run fasttext.sh to get fastText word vectors.

Pre-trained models

Two types of pre-trained models are provided. w and f represent word2vec and fastText respectively.

Language	ISO 639-1	Vector Size	Corpus Size	Vocabulary Size
Bengali (w) \| Bengali (f)	bn	300	147M	10059
Catalan (w) \| Catalan (f)	ca	300	967M	50013
Chinese (w) \| Chinese (f)	zh	300	1G	50101
Danish (w) \| Danish (f)	da	300	295M	30134
Dutch (w) \| Dutch (f)	nl	300	1G	50160
Esperanto (w) \| Esperanto (f)	eo	300	1G	50597
Finnish (w) \| Finnish (f)	fi	300	467M	30029
French (w) \| French (f)	fr	300	1G	50130
German (w) \| German (f)	de	300	1G	50006
Hindi (w) \| Hindi (f)	hi	300	323M	30393
Hungarian (w) \| Hungarian (f)	hu	300	692M	40122
Indonesian (w) \| Indonesian (f)	id	300	402M	30048
Italian (w) \| Italian (f)	it	300	1G	50031
Japanese (w) \| Japanese (f)	ja	300	1G	50108
Javanese (w) \| Javanese (f)	jv	100	31M	10019
Korean (w) \| Korean (f)	ko	200	339M	30185
Malay (w) \| Malay (f)	ms	100	173M	10010
Norwegian (w) \| Norwegian (f)	no	300	1G	50209
Norwegian Nynorsk (w) \| Norwegian Nynorsk (f)	nn	100	114M	10036
Polish (w) \| Polish (f)	pl	300	1G	50035
Portuguese (w) \| Portuguese (f)	pt	300	1G	50246
Russian (w) \| Russian (f)	ru	300	1G	50102
Spanish (w) \| Spanish (f)	es	300	1G	50003
Swahili (w) \| Swahili (f)	sw	100	24M	10222
Swedish (w) \| Swedish (f)	sv	300	1G	50052
Tagalog (w) \| Tagalog (f)	tl	100	38M	10068
Thai (w) \| Thai (f)	th	300	696M	30225
Turkish (w) \| Turkish (f)	tr	200	370M	30036
Vietnamese (w) \| Vietnamese (f)	vi	100	74M	10087

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Kyubyong / Wordvectors

Programming Languages

Labels

Projects that are alternatives of or similar to Wordvectors

Pre-trained word vectors of 30+ languages

Requirements

Background / References

Work Flow

Pre-trained models