All Projects → dongjun-Lee → Kor2vec

dongjun-Lee / Kor2vec

Licence: mit
Library for Korean morpheme and word vector representation

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Kor2vec

Practical 1
Oxford Deep NLP 2017 course - Practical 1: word2vec
Stars: ✭ 220 (+243.75%)
Mutual labels:  natural-language-processing, word2vec
Languagecrunch
LanguageCrunch NLP server docker image
Stars: ✭ 281 (+339.06%)
Mutual labels:  natural-language-processing, word2vec
Hunspell Dict Ko
Korean spellchecking dictionary for Hunspell
Stars: ✭ 187 (+192.19%)
Mutual labels:  korean, natural-language-processing
Gensim
Topic Modelling for Humans
Stars: ✭ 12,763 (+19842.19%)
Mutual labels:  natural-language-processing, word2vec
Repo 2017
Python codes in Machine Learning, NLP, Deep Learning and Reinforcement Learning with Keras and Theano
Stars: ✭ 1,123 (+1654.69%)
Mutual labels:  natural-language-processing, word2vec
Deep Math Machine Learning.ai
A blog which talks about machine learning, deep learning algorithms and the Math. and Machine learning algorithms written from scratch.
Stars: ✭ 173 (+170.31%)
Mutual labels:  natural-language-processing, word2vec
Char Rnn Tensorflow
Multi-layer Recurrent Neural Networks for character-level language models implements by TensorFlow
Stars: ✭ 58 (-9.37%)
Mutual labels:  korean, natural-language-processing
Magnitude
A fast, efficient universal vector embedding utility package.
Stars: ✭ 1,394 (+2078.13%)
Mutual labels:  natural-language-processing, word2vec
Cs224n
CS224n: Natural Language Processing with Deep Learning Assignments Winter, 2017
Stars: ✭ 656 (+925%)
Mutual labels:  natural-language-processing, word2vec
Open Korean Text
Open Korean Text Processor - An Open-source Korean Text Processor
Stars: ✭ 438 (+584.38%)
Mutual labels:  korean, natural-language-processing
Scattertext Pydata
Notebooks for the Seattle PyData 2017 talk on Scattertext
Stars: ✭ 132 (+106.25%)
Mutual labels:  natural-language-processing, word2vec
Nlp In Practice
Starter code to solve real world text data problems. Includes: Gensim Word2Vec, phrase embeddings, Text Classification with Logistic Regression, word count with pyspark, simple text preprocessing, pre-trained embeddings and more.
Stars: ✭ 790 (+1134.38%)
Mutual labels:  natural-language-processing, word2vec
Scattertext
Beautiful visualizations of how language differs among document types.
Stars: ✭ 1,722 (+2590.63%)
Mutual labels:  natural-language-processing, word2vec
Germanwordembeddings
Toolkit to obtain and preprocess german corpora, train models using word2vec (gensim) and evaluate them with generated testsets
Stars: ✭ 189 (+195.31%)
Mutual labels:  natural-language-processing, word2vec
Awesome Embedding Models
A curated list of awesome embedding models tutorials, projects and communities.
Stars: ✭ 1,486 (+2221.88%)
Mutual labels:  natural-language-processing, word2vec
Pytorch Bert Crf Ner
KoBERT와 CRF로 만든 한국어 개체명인식기 (BERT+CRF based Named Entity Recognition model for Korean)
Stars: ✭ 236 (+268.75%)
Mutual labels:  korean, natural-language-processing
Ja.text8
Japanese text8 corpus for word embedding.
Stars: ✭ 79 (+23.44%)
Mutual labels:  natural-language-processing, word2vec
Repo 2016
R, Python and Mathematica Codes in Machine Learning, Deep Learning, Artificial Intelligence, NLP and Geolocation
Stars: ✭ 103 (+60.94%)
Mutual labels:  natural-language-processing, word2vec
Natural Language Processing
Programming Assignments and Lectures for Stanford's CS 224: Natural Language Processing with Deep Learning
Stars: ✭ 377 (+489.06%)
Mutual labels:  natural-language-processing, word2vec
Text2vec
Fast vectorization, topic modeling, distances and GloVe word embeddings in R.
Stars: ✭ 715 (+1017.19%)
Mutual labels:  natural-language-processing, word2vec

kor2vec

Library for Korean morpheme and word vector representation.

Requirements

For training,

  • Python 3
  • Tensorflow
  • numpy, scipy
  • Konlpy (Twitter)

For test and visualization,

  • gensim
  • sklearn
  • matplotlib

Model

model We define each word as a set of its morphemes, and a word vector is represented by the sum of the vector of its morphemes.

Train Vectors

In order to learn morpheme vectors, do:

$ python3 train.py <input_corpus>

<input_corpus> format : one sentence = one line

Change Hyperparameters

$ python3 train.py -h
usage: train.py [-h] [--embedding_size EMBEDDING_SIZE]
                [--window_size WINDOW_SIZE] [--min_count MIN_COUNT]
                [--num_sampled NUM_SAMPLED] [--learning_rate LEARNING_RATE]
                [--sampling_rate SAMPLING_RATE] [--epochs EPOCHS]
                [--batch_size BATCH_SIZE]
                input

positional arguments:
  input                 input text file for training: one sentence per line

optional arguments:
  -h, --help            show this help message and exit
  --embedding_size EMBEDDING_SIZE
                        embedding vector size (default=150)
  --window_size WINDOW_SIZE
                        window size (default=5)
  --min_count MIN_COUNT
                        minimal number of word occurences (default=5)
  --num_sampled NUM_SAMPLED
                        number of negatives sampled (default=50)
  --learning_rate LEARNING_RATE
                        learning rate (default=1.0)
  --sampling_rate SAMPLING_RATE
                        rate for subsampling frequent words (default=0.0001)
  --epochs EPOCHS       number of epochs (default=3)
  --batch_size BATCH_SIZE
                        batch size (default=150)

Load Trained Morpheme Vectors

$ python3
>>>> from gensim.models.keyedvectors import KeyedVectors
>>>> pos_vectors = KeyedVectors.load_word2vec_format('pos.vec', binary=False)
>>>> pos_vectors.most_similar("('대통령','Noun')")

Generate Word Vectors

A word vector is defined by sum of its morphemes' vectors.

$ python3
>>>> from konlpy.tag import Twitter
>>>> import numpy as np
>>>> twitter = Twitter()
>>>> word = "대통령이"
>>>> pos_list = twitter.pos(word, norm=True)
>>>> word_vector = np.sum([pos_vectors.word_vec(str(pos).replace(" ", "")) for pos in pos_list], axis=0)

Test Dataset

Test Morpheme Vectors

Similarity Test

Word similarity test using kor_ws353.csv.

$ python3 test/similarity_test.py pos.vec

Analogy Test (Semantic)

Word analogy test using kor_analogy_semantic.txt.

$ python3 test/analogy_test.py pos.vec

Visualization

Visualize the learned embeddings on two dimensional space using PCA.

$ python3 test/visualization.py pos.vec --words 밥 밥을 물 물을

Donwload Pre-trained Morpheme Vectors

Morpheme vectors are trained on Naver news corpus (218M tokens) using our model. You can download pre-trained morpheme vectors here : http://mmlab.snu.ac.kr/~djlee/pos.vec

Load Vectors using Gensim Library

$ python3
>>>> from gensim.models.keyedvectors import KeyedVectors
>>>> pos_vectors = KeyedVectors.load_word2vec_format('pos.vec', binary=False)
>>>> pos_vectors.most_similar("('대통령','Noun')")
>>>> pos_vectors.most_similar(positive=["('도쿄','Noun')", "('프랑스','Noun')"], negative=["('일본','Noun')"])
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].