dongjun-Lee / Kor2vec
Licence: mit
Library for Korean morpheme and word vector representation
Stars: ✭ 64
Programming Languages
python
139335 projects - #7 most used programming language
Projects that are alternatives of or similar to Kor2vec
Practical 1
Oxford Deep NLP 2017 course - Practical 1: word2vec
Stars: ✭ 220 (+243.75%)
Mutual labels: natural-language-processing, word2vec
Languagecrunch
LanguageCrunch NLP server docker image
Stars: ✭ 281 (+339.06%)
Mutual labels: natural-language-processing, word2vec
Hunspell Dict Ko
Korean spellchecking dictionary for Hunspell
Stars: ✭ 187 (+192.19%)
Mutual labels: korean, natural-language-processing
Gensim
Topic Modelling for Humans
Stars: ✭ 12,763 (+19842.19%)
Mutual labels: natural-language-processing, word2vec
Repo 2017
Python codes in Machine Learning, NLP, Deep Learning and Reinforcement Learning with Keras and Theano
Stars: ✭ 1,123 (+1654.69%)
Mutual labels: natural-language-processing, word2vec
Deep Math Machine Learning.ai
A blog which talks about machine learning, deep learning algorithms and the Math. and Machine learning algorithms written from scratch.
Stars: ✭ 173 (+170.31%)
Mutual labels: natural-language-processing, word2vec
Char Rnn Tensorflow
Multi-layer Recurrent Neural Networks for character-level language models implements by TensorFlow
Stars: ✭ 58 (-9.37%)
Mutual labels: korean, natural-language-processing
Magnitude
A fast, efficient universal vector embedding utility package.
Stars: ✭ 1,394 (+2078.13%)
Mutual labels: natural-language-processing, word2vec
Cs224n
CS224n: Natural Language Processing with Deep Learning Assignments Winter, 2017
Stars: ✭ 656 (+925%)
Mutual labels: natural-language-processing, word2vec
Open Korean Text
Open Korean Text Processor - An Open-source Korean Text Processor
Stars: ✭ 438 (+584.38%)
Mutual labels: korean, natural-language-processing
Scattertext Pydata
Notebooks for the Seattle PyData 2017 talk on Scattertext
Stars: ✭ 132 (+106.25%)
Mutual labels: natural-language-processing, word2vec
Nlp In Practice
Starter code to solve real world text data problems. Includes: Gensim Word2Vec, phrase embeddings, Text Classification with Logistic Regression, word count with pyspark, simple text preprocessing, pre-trained embeddings and more.
Stars: ✭ 790 (+1134.38%)
Mutual labels: natural-language-processing, word2vec
Scattertext
Beautiful visualizations of how language differs among document types.
Stars: ✭ 1,722 (+2590.63%)
Mutual labels: natural-language-processing, word2vec
Germanwordembeddings
Toolkit to obtain and preprocess german corpora, train models using word2vec (gensim) and evaluate them with generated testsets
Stars: ✭ 189 (+195.31%)
Mutual labels: natural-language-processing, word2vec
Awesome Embedding Models
A curated list of awesome embedding models tutorials, projects and communities.
Stars: ✭ 1,486 (+2221.88%)
Mutual labels: natural-language-processing, word2vec
Pytorch Bert Crf Ner
KoBERT와 CRF로 만든 한국어 개체명인식기 (BERT+CRF based Named Entity Recognition model for Korean)
Stars: ✭ 236 (+268.75%)
Mutual labels: korean, natural-language-processing
Ja.text8
Japanese text8 corpus for word embedding.
Stars: ✭ 79 (+23.44%)
Mutual labels: natural-language-processing, word2vec
Repo 2016
R, Python and Mathematica Codes in Machine Learning, Deep Learning, Artificial Intelligence, NLP and Geolocation
Stars: ✭ 103 (+60.94%)
Mutual labels: natural-language-processing, word2vec
Natural Language Processing
Programming Assignments and Lectures for Stanford's CS 224: Natural Language Processing with Deep Learning
Stars: ✭ 377 (+489.06%)
Mutual labels: natural-language-processing, word2vec
Text2vec
Fast vectorization, topic modeling, distances and GloVe word embeddings in R.
Stars: ✭ 715 (+1017.19%)
Mutual labels: natural-language-processing, word2vec
kor2vec
Library for Korean morpheme and word vector representation.
Requirements
For training,
- Python 3
- Tensorflow
- numpy, scipy
- Konlpy (Twitter)
For test and visualization,
- gensim
- sklearn
- matplotlib
Model We define each word as a set of its morphemes, and a word vector is represented by the sum of the vector of its morphemes.
Train Vectors
In order to learn morpheme vectors, do:
$ python3 train.py <input_corpus>
<input_corpus> format : one sentence = one line
Change Hyperparameters
$ python3 train.py -h
usage: train.py [-h] [--embedding_size EMBEDDING_SIZE]
[--window_size WINDOW_SIZE] [--min_count MIN_COUNT]
[--num_sampled NUM_SAMPLED] [--learning_rate LEARNING_RATE]
[--sampling_rate SAMPLING_RATE] [--epochs EPOCHS]
[--batch_size BATCH_SIZE]
input
positional arguments:
input input text file for training: one sentence per line
optional arguments:
-h, --help show this help message and exit
--embedding_size EMBEDDING_SIZE
embedding vector size (default=150)
--window_size WINDOW_SIZE
window size (default=5)
--min_count MIN_COUNT
minimal number of word occurences (default=5)
--num_sampled NUM_SAMPLED
number of negatives sampled (default=50)
--learning_rate LEARNING_RATE
learning rate (default=1.0)
--sampling_rate SAMPLING_RATE
rate for subsampling frequent words (default=0.0001)
--epochs EPOCHS number of epochs (default=3)
--batch_size BATCH_SIZE
batch size (default=150)
Load Trained Morpheme Vectors
$ python3
>>>> from gensim.models.keyedvectors import KeyedVectors
>>>> pos_vectors = KeyedVectors.load_word2vec_format('pos.vec', binary=False)
>>>> pos_vectors.most_similar("('대통령','Noun')")
Generate Word Vectors
A word vector is defined by sum of its morphemes' vectors.
$ python3
>>>> from konlpy.tag import Twitter
>>>> import numpy as np
>>>> twitter = Twitter()
>>>> word = "대통령이"
>>>> pos_list = twitter.pos(word, norm=True)
>>>> word_vector = np.sum([pos_vectors.word_vec(str(pos).replace(" ", "")) for pos in pos_list], axis=0)
Test Dataset
-
Word Similarity Test : Translated WordSim 353 Dataset into Korean. Translation ambiguous words were excluded.
- WordSim 353 Dataset : http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/wordsim353.html
- Word Analogy Test : Created Semantic Pair 420 questions + Syntactic Pair 840 questions.
Test Morpheme Vectors
Similarity Test
Word similarity test using kor_ws353.csv.
$ python3 test/similarity_test.py pos.vec
Analogy Test (Semantic)
Word analogy test using kor_analogy_semantic.txt.
$ python3 test/analogy_test.py pos.vec
Visualization
Visualize the learned embeddings on two dimensional space using PCA.
$ python3 test/visualization.py pos.vec --words 밥 밥을 물 물을
Donwload Pre-trained Morpheme Vectors
Morpheme vectors are trained on Naver news corpus (218M tokens) using our model. You can download pre-trained morpheme vectors here : http://mmlab.snu.ac.kr/~djlee/pos.vec
Load Vectors using Gensim Library
$ python3
>>>> from gensim.models.keyedvectors import KeyedVectors
>>>> pos_vectors = KeyedVectors.load_word2vec_format('pos.vec', binary=False)
>>>> pos_vectors.most_similar("('대통령','Noun')")
>>>> pos_vectors.most_similar(positive=["('도쿄','Noun')", "('프랑스','Noun')"], negative=["('일본','Noun')"])
Note that the project description data, including the texts, logos, images, and/or trademarks,
for each open source project belongs to its rightful owner.
If you wish to add or remove any projects, please contact us at [email protected].