All Projects โ†’ haven-jeon โ†’ KoSpacing

haven-jeon / KoSpacing

Licence: other
Automatic Korean word spacing with R

Programming Languages

r
7636 projects

Projects that are alternatives of or similar to KoSpacing

BERT-embedding
A simple wrapper class for extracting features(embedding) and comparing them using BERT in TensorFlow
Stars: โœญ 24 (-68.42%)
Mutual labels:  korean, korean-nlp
hangul-search-js
๐Ÿ‡ฐ๐Ÿ‡ท Simple Korean text search module
Stars: โœญ 22 (-71.05%)
Mutual labels:  korean, korean-nlp
PyKOMORAN
(Beta) PyKOMORAN is wrapped KOMORAN in Python using Py4J.
Stars: โœญ 38 (-50%)
Mutual labels:  korean, korean-nlp
kss
Kss: A Toolkit for Korean sentence segmentation
Stars: โœญ 198 (+160.53%)
Mutual labels:  korean, korean-nlp
KoEDA
Korean Easy Data Augmentation
Stars: โœญ 62 (-18.42%)
Mutual labels:  korean, korean-nlp
detox
Korean Hate Speech Detection Model
Stars: โœญ 38 (-50%)
Mutual labels:  korean, korean-nlp
g2pK
g2pK: g2p module for Korean
Stars: โœญ 137 (+80.26%)
Mutual labels:  korean, korean-nlp
KLUE
๐Ÿ“– Korean NLU Benchmark
Stars: โœญ 420 (+452.63%)
Mutual labels:  korean, korean-nlp
Hangulize
Hangulize transcribes non-Korean words into Hangul
Stars: โœญ 152 (+100%)
Mutual labels:  korean
Cs Univ Wiki
์ปด๊ณต์ƒ์„ ์œ„ํ•œ ๋Œ€ํ•™ ์ƒํ™œ ๊ฐ€์ด๋“œ๋ผ์ธ
Stars: โœญ 202 (+165.79%)
Mutual labels:  korean
Koalanlp
KoalaNLP = Korean + Scala + NLP. ํ•œ๊ตญ์–ด ํ˜•ํƒœ์†Œ ๋ฐ ๊ตฌ๋ฌธ ๋ถ„์„๊ธฐ์˜ ๋ชจ์Œ์ž…๋‹ˆ๋‹ค.
Stars: โœญ 146 (+92.11%)
Mutual labels:  korean
Tossi
Chooses correct Korean particle morphs for arbitrary words.
Stars: โœญ 160 (+110.53%)
Mutual labels:  korean
Kime
Korean IME
Stars: โœญ 208 (+173.68%)
Mutual labels:  korean
Pytorch Tutorials Kr
๐Ÿ‡ฐ๐Ÿ‡ทPyTorch์—์„œ ์ œ๊ณตํ•˜๋Š” ํŠœํ† ๋ฆฌ์–ผ์˜ ํ•œ๊ตญ์–ด ๋ฒˆ์—ญ์„ ์œ„ํ•œ ์ €์žฅ์†Œ์ž…๋‹ˆ๋‹ค. (Translate PyTorch tutorials in Korean๐Ÿ‡ฐ๐Ÿ‡ท)
Stars: โœญ 148 (+94.74%)
Mutual labels:  korean
Golang News
Golang ๊ธฐ์ˆ  ์†Œ์‹ ๋‰ด์Šค๋ ˆํ„ฐ
Stars: โœญ 233 (+206.58%)
Mutual labels:  korean
Inko
๐Ÿ‡ฐ๐Ÿ‡ท์˜ํƒ€๋ฅผ ํ•œ๊ธ€๋กœ, ํ•œํƒ€๋ฅผ ์˜์–ด๋กœ ๋ณ€ํ™˜ํ•ด์ฃผ๋Š” ์ž๋ฐ”์Šคํฌ๋ฆฝํŠธ ์˜คํ”ˆ์†Œ์Šค ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ
Stars: โœญ 143 (+88.16%)
Mutual labels:  korean
The Road To Learn React Korean
๐Ÿ‡ฐ๐Ÿ‡ท ๋ฆฌ์•กํŠธ ๋„์›€๋‹ซ๊ธฐ - The the Road to learn React (2018) [Deprecated]
Stars: โœญ 140 (+84.21%)
Mutual labels:  korean
Nodejs Ko
node.js ํ•œ๊ตญ ์ปค๋ฎค๋‹ˆํ‹ฐ
Stars: โœญ 240 (+215.79%)
Mutual labels:  korean
Neodgm
Modern TrueType font based on an old-but-good Korean bitmap font.
Stars: โœญ 230 (+202.63%)
Mutual labels:  korean
Hangulize
Korean Alphabet Transcription
Stars: โœญ 184 (+142.11%)
Mutual labels:  korean

KoSpacing

License: GPL v3

R package for automatic Korean word spacing.

Python verson can be found here.

Introduction

Word spacing is one of the important parts of the preprocessing of Korean text analysis. Accurate spacing greatly affects the accuracy of subsequent text analysis. KoSpacing has fairly accurate automatic word spacing performance, especially good for online text originated from SNS or SMS.

For example.

โ€œ์•„๋ฒ„์ง€๊ฐ€๋ฐฉ์—๋“ค์–ด๊ฐ€์‹ ๋‹ค.โ€ can be spaced both of below.

  1. โ€œ์•„๋ฒ„์ง€๊ฐ€ ๋ฐฉ์— ๋“ค์–ด๊ฐ€์‹ ๋‹ค.โ€ means โ€œMy father enters the room.โ€
  2. โ€œ์•„๋ฒ„์ง€ ๊ฐ€๋ฐฉ์— ๋“ค์–ด๊ฐ€์‹ ๋‹ค.โ€ means โ€œMy father goes into the bag.โ€

Common sense, the first is the right answer.

KoSpacing is based on Deep Learning model trained from large corpus(more than 100 million NEWS articles from Chan-Yub Park).

Performance

Test Set Accuracy
Sejong(colloquial style) Corpus(1M) 97.1%
OOOO(literary style) Corpus(3M) 94.3%
  • Accuracy = # correctly spaced characters/# characters in the test data.
    • Might be increased performance if normalize compound words.

Install

To install from GitHub, use

install.packages('remotes')
remotes::install_github('haven-jeon/KoSpacing')
library(KoSpacing)
set_env()

Example

library(KoSpacing)
#> If you install package first fime,
#> Please set_env() run before using spacing()
spacing("๊น€ํ˜•ํ˜ธ์˜ํ™”์‹œ์žฅ๋ถ„์„๊ฐ€๋Š”'1987'์˜๋„ค์ด๋ฒ„์˜ํ™”์ •๋ณด๋„คํ‹ฐ์ฆŒ10์ ํ‰์—์„œ์–ธ๊ธ‰๋œ๋‹จ์–ด๋“ค์„์ง€๋‚œํ•ด12์›”27์ผ๋ถ€ํ„ฐ์˜ฌํ•ด1์›”10์ผ๊นŒ์ง€ํ†ต๊ณ„ํ”„๋กœ๊ทธ๋žจR๊ณผKoNLPํŒจํ‚ค์ง€๋กœํ…์ŠคํŠธ๋งˆ์ด๋‹ํ•˜์—ฌ๋ถ„์„ํ–ˆ๋‹ค.")
#> loaded KoSpacing model!
#> [1] "๊น€ํ˜•ํ˜ธ ์˜ํ™”์‹œ์žฅ ๋ถ„์„๊ฐ€๋Š” '1987'์˜ ๋„ค์ด๋ฒ„ ์˜ํ™” ์ •๋ณด ๋„คํ‹ฐ์ฆŒ 10์  ํ‰์—์„œ ์–ธ๊ธ‰๋œ ๋‹จ์–ด๋“ค์„ ์ง€๋‚œํ•ด 12์›” 27์ผ๋ถ€ํ„ฐ ์˜ฌํ•ด 1์›” 10์ผ๊นŒ์ง€ ํ†ต๊ณ„ ํ”„๋กœ๊ทธ๋žจ R๊ณผ KoNLP ํŒจํ‚ค์ง€๋กœ ํ…์ŠคํŠธ๋งˆ์ด๋‹ํ•˜์—ฌ ๋ถ„์„ํ–ˆ๋‹ค."

Model Architecture

Citation

@misc{heewon2018,
author = {Heewon Jeon},
title = {KoSpacing: Automatic Korean word spacing},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/haven-jeon/KoSpacing}}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].