KoSpacing
R package for automatic Korean word spacing.
Python verson can be found here.
Introduction
Word spacing is one of the important parts of the preprocessing of
Korean text analysis. Accurate spacing greatly affects the accuracy of
subsequent text analysis. KoSpacing
has fairly accurate automatic word
spacing performance, especially good for online text originated from SNS
or SMS.
For example.
โ์๋ฒ์ง๊ฐ๋ฐฉ์๋ค์ด๊ฐ์ ๋ค.โ can be spaced both of below.
- โ์๋ฒ์ง๊ฐ ๋ฐฉ์ ๋ค์ด๊ฐ์ ๋ค.โ means โMy father enters the room.โ
- โ์๋ฒ์ง ๊ฐ๋ฐฉ์ ๋ค์ด๊ฐ์ ๋ค.โ means โMy father goes into the bag.โ
Common sense, the first is the right answer.
KoSpacing
is based on Deep Learning model trained from large
corpus(more than 100 million NEWS articles from Chan-Yub
Park).
Performance
Test Set | Accuracy |
---|---|
Sejong(colloquial style) Corpus(1M) | 97.1% |
OOOO(literary style) Corpus(3M) | 94.3% |
- Accuracy = # correctly spaced characters/# characters in the test
data.
- Might be increased performance if normalize compound words.
Install
To install from GitHub, use
install.packages('remotes')
remotes::install_github('haven-jeon/KoSpacing')
library(KoSpacing)
set_env()
Example
library(KoSpacing)
#> If you install package first fime,
#> Please set_env() run before using spacing()
spacing("๊นํํธ์ํ์์ฅ๋ถ์๊ฐ๋'1987'์๋ค์ด๋ฒ์ํ์ ๋ณด๋คํฐ์ฆ10์ ํ์์์ธ๊ธ๋๋จ์ด๋ค์์ง๋ํด12์27์ผ๋ถํฐ์ฌํด1์10์ผ๊น์งํต๊ณํ๋ก๊ทธ๋จR๊ณผKoNLPํจํค์ง๋กํ
์คํธ๋ง์ด๋ํ์ฌ๋ถ์ํ๋ค.")
#> loaded KoSpacing model!
#> [1] "๊นํํธ ์ํ์์ฅ ๋ถ์๊ฐ๋ '1987'์ ๋ค์ด๋ฒ ์ํ ์ ๋ณด ๋คํฐ์ฆ 10์ ํ์์ ์ธ๊ธ๋ ๋จ์ด๋ค์ ์ง๋ํด 12์ 27์ผ๋ถํฐ ์ฌํด 1์ 10์ผ๊น์ง ํต๊ณ ํ๋ก๊ทธ๋จ R๊ณผ KoNLP ํจํค์ง๋ก ํ
์คํธ๋ง์ด๋ํ์ฌ ๋ถ์ํ๋ค."
Model Architecture
Citation
@misc{heewon2018,
author = {Heewon Jeon},
title = {KoSpacing: Automatic Korean word spacing},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/haven-jeon/KoSpacing}}