All Projects → Kyubyong → Wordvectors

Kyubyong / Wordvectors

Licence: mit
Pre-trained word vectors of 30+ languages

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to Wordvectors

Gensim
Topic Modelling for Humans
Stars: ✭ 12,763 (+524.72%)
Mutual labels:  word2vec, fasttext
Vectorsinsearch
Dice.com repo to accompany the dice.com 'Vectors in Search' talk by Simon Hughes, from the Activate 2018 search conference, and the 'Searching with Vectors' talk from Haystack 2019 (US). Builds upon my conceptual search and semantic search work from 2015
Stars: ✭ 71 (-96.52%)
Mutual labels:  vector, word2vec
Lmdb Embeddings
Fast word vectors with little memory usage in Python
Stars: ✭ 404 (-80.23%)
Mutual labels:  word2vec, fasttext
Persian-Sentiment-Analyzer
Persian sentiment analysis ( آناکاوی سهش های فارسی | تحلیل احساسات فارسی )
Stars: ✭ 30 (-98.53%)
Mutual labels:  word2vec, fasttext
Fasttext.js
FastText for Node.js
Stars: ✭ 127 (-93.78%)
Mutual labels:  word2vec, fasttext
Vectorhub
Vector Hub - Library for easy discovery, and consumption of State-of-the-art models to turn data into vectors. (text2vec, image2vec, video2vec, graph2vec, bert, inception, etc)
Stars: ✭ 317 (-84.48%)
Mutual labels:  vector, word2vec
Finalfusion Rust
finalfusion embeddings in Rust
Stars: ✭ 35 (-98.29%)
Mutual labels:  word2vec, fasttext
Simple-Sentence-Similarity
Exploring the simple sentence similarity measurements using word embeddings
Stars: ✭ 99 (-95.15%)
Mutual labels:  word2vec, fasttext
Nlp
兜哥出品 <一本开源的NLP入门书籍>
Stars: ✭ 1,677 (-17.91%)
Mutual labels:  word2vec, fasttext
Magnitude
A fast, efficient universal vector embedding utility package.
Stars: ✭ 1,394 (-31.77%)
Mutual labels:  word2vec, fasttext
word embedding
Sample code for training Word2Vec and FastText using wiki corpus and their pretrained word embedding..
Stars: ✭ 21 (-98.97%)
Mutual labels:  word2vec, fasttext
Wordembeddings Elmo Fasttext Word2vec
Using pre trained word embeddings (Fasttext, Word2Vec)
Stars: ✭ 146 (-92.85%)
Mutual labels:  word2vec, fasttext
Embedding
Embedding模型代码和学习笔记总结
Stars: ✭ 25 (-98.78%)
Mutual labels:  word2vec, fasttext
Embedding As Service
One-Stop Solution to encode sentence to fixed length vectors from various embedding techniques
Stars: ✭ 151 (-92.61%)
Mutual labels:  word2vec, fasttext
NLP-paper
🎨 🎨NLP 自然语言处理教程 🎨🎨 https://dataxujing.github.io/NLP-paper/
Stars: ✭ 23 (-98.87%)
Mutual labels:  word2vec, fasttext
Neural Networks
All about Neural Networks!
Stars: ✭ 34 (-98.34%)
Mutual labels:  word2vec, fasttext
Shallowlearn
An experiment about re-implementing supervised learning models based on shallow neural network approaches (e.g. fastText) with some additional exclusive features and nice API. Written in Python and fully compatible with Scikit-learn.
Stars: ✭ 196 (-90.41%)
Mutual labels:  word2vec, fasttext
Cw2vec
cw2vec: Learning Chinese Word Embeddings with Stroke n-gram Information
Stars: ✭ 224 (-89.04%)
Mutual labels:  word2vec, fasttext
Nlp Journey
Documents, papers and codes related to Natural Language Processing, including Topic Model, Word Embedding, Named Entity Recognition, Text Classificatin, Text Generation, Text Similarity, Machine Translation),etc. All codes are implemented intensorflow 2.0.
Stars: ✭ 1,290 (-36.86%)
Mutual labels:  word2vec, fasttext
Nlp research
NLP research:基于tensorflow的nlp深度学习项目,支持文本分类/句子匹配/序列标注/文本生成 四大任务
Stars: ✭ 141 (-93.1%)
Mutual labels:  word2vec, fasttext

Pre-trained word vectors of 30+ languages

This project has two purposes. First of all, I'd like to share some of my experience in nlp tasks such as segmentation or word vectors. The other, which is more important, is that probably some people are searching for pre-trained word vector models for non-English languages. Alas! English has gained much more attention than any other languages has done. Check this to see how easily you can get a variety of pre-trained English word vectors without efforts. I think it's time to turn our eyes to a multi language version of this.

Nearing the end of the work, I happened to know that there is already a similar job named polyglot. I strongly encourage you to check this great project. How embarrassing! Nevertheless, I decided to open this project. You will know that my job has its own flavor, after all.

Requirements

  • nltk >= 1.11.1
  • regex >= 2016.6.24
  • lxml >= 3.3.3
  • numpy >= 1.11.2
  • konlpy >= 0.4.4 (Only for Korean)
  • mecab (Only for Japanese)
  • pythai >= 0.1.3 (Only for Thai)
  • pyvi >= 0.0.7.2 (Only for Vietnamese)
  • jieba >= 0.38 (Only for Chinese)
  • gensim > =0.13.1 (for Word2Vec)
  • fastText (for fasttext)

Background / References

  • Check this to know what word embedding is.
  • Check this to quickly get a picture of Word2vec.
  • Check this to install fastText.
  • Watch this to really understand what's happening under the hood of Word2vec.
  • Go get various English word vectors here if needed.

Work Flow

  • STEP 1. Download the wikipedia database backup dumps of the language you want.
  • STEP 2. Extract running texts to data/ folder.
  • STEP 3. Run build_corpus.py.
  • STEP 4-1. Run make_wordvector.sh to get Word2Vec word vectors.
  • STEP 4-2. Run fasttext.sh to get fastText word vectors.

Pre-trained models

Two types of pre-trained models are provided. w and f represent word2vec and fastText respectively.

Language ISO 639-1 Vector Size Corpus Size Vocabulary Size
Bengali (w) | Bengali (f) bn 300 147M 10059
Catalan (w) | Catalan (f) ca 300 967M 50013
Chinese (w) | Chinese (f) zh 300 1G 50101
Danish (w) | Danish (f) da 300 295M 30134
Dutch (w) | Dutch (f) nl 300 1G 50160
Esperanto (w) | Esperanto (f) eo 300 1G 50597
Finnish (w) | Finnish (f) fi 300 467M 30029
French (w) | French (f) fr 300 1G 50130
German (w) | German (f) de 300 1G 50006
Hindi (w) | Hindi (f) hi 300 323M 30393
Hungarian (w) | Hungarian (f) hu 300 692M 40122
Indonesian (w) | Indonesian (f) id 300 402M 30048
Italian (w) | Italian (f) it 300 1G 50031
Japanese (w) | Japanese (f) ja 300 1G 50108
Javanese (w) | Javanese (f) jv 100 31M 10019
Korean (w) | Korean (f) ko 200 339M 30185
Malay (w) | Malay (f) ms 100 173M 10010
Norwegian (w) | Norwegian (f) no 300 1G 50209
Norwegian Nynorsk (w) | Norwegian Nynorsk (f) nn 100 114M 10036
Polish (w) | Polish (f) pl 300 1G 50035
Portuguese (w) | Portuguese (f) pt 300 1G 50246
Russian (w) | Russian (f) ru 300 1G 50102
Spanish (w) | Spanish (f) es 300 1G 50003
Swahili (w) | Swahili (f) sw 100 24M 10222
Swedish (w) | Swedish (f) sv 300 1G 50052
Tagalog (w) | Tagalog (f) tl 100 38M 10068
Thai (w) | Thai (f) th 300 696M 30225
Turkish (w) | Turkish (f) tr 200 370M 30036
Vietnamese (w) | Vietnamese (f) vi 100 74M 10087
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].