Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → open-korean-text → Open Korean Text

open-korean-text / Open Korean Text

Licence: apache-2.0

Open Korean Text Processor - An Open-source Korean Text Processor

Programming Languages

scala

5932 projects

Labels

natural-language-processing korean text-processing tokenizer

Projects that are alternatives of or similar to Open Korean Text

Ekphrasis

Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).

Stars: ✭ 433 (-1.14%)

Mutual labels: tokenizer, text-processing

Kor2vec

Library for Korean morpheme and word vector representation

Stars: ✭ 64 (-85.39%)

Mutual labels: korean, natural-language-processing

Udpipe

R package for Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing Based on the UDPipe Natural Language Processing Toolkit

Stars: ✭ 160 (-63.47%)

Mutual labels: tokenizer, natural-language-processing

Thot

Thot toolkit for statistical machine translation

Stars: ✭ 53 (-87.9%)

Mutual labels: tokenizer, natural-language-processing

python-mecab

A repository to bind mecab for Python 3.5+. Not using swig nor pybind. (Not Maintained Now)

Stars: ✭ 27 (-93.84%)

Mutual labels: tokenizer, text-processing

Kadot

Kadot, the unsupervised natural language processing library.

Stars: ✭ 108 (-75.34%)

Mutual labels: tokenizer, natural-language-processing

Char Rnn Tensorflow

Multi-layer Recurrent Neural Networks for character-level language models implements by TensorFlow

Stars: ✭ 58 (-86.76%)

Mutual labels: korean, natural-language-processing

Fastnlp

fastNLP: A Modularized and Extensible NLP Framework. Currently still in incubation.

Stars: ✭ 2,441 (+457.31%)

Mutual labels: natural-language-processing, text-processing

Text-Classification-LSTMs-PyTorch

The aim of this repository is to show a baseline model for text classification by implementing a LSTM-based model coded in PyTorch. In order to provide a better understanding of the model, it will be used a Tweets dataset provided by Kaggle.

Stars: ✭ 45 (-89.73%)

Mutual labels: tokenizer, text-processing

Pytorch Bert Crf Ner

KoBERT와 CRF로 만든 한국어 개체명인식기 (BERT+CRF based Named Entity Recognition model for Korean)

Stars: ✭ 236 (-46.12%)

Mutual labels: korean, natural-language-processing

Greynir

The greynir.is natural language processing website for Icelandic

Stars: ✭ 47 (-89.27%)

Mutual labels: tokenizer, natural-language-processing

ArabicProcessingCog

A Python package that do stemming, tokenization, sentence breaking, segmentation, normalization, POS tagging for Arabic language.

Stars: ✭ 19 (-95.66%)

Mutual labels: tokenizer, text-processing

Py Nltools

A collection of basic python modules for spoken natural language processing

Stars: ✭ 46 (-89.5%)

Mutual labels: tokenizer, natural-language-processing

Tokenizer

Fast and customizable text tokenization library with BPE and SentencePiece support

Stars: ✭ 132 (-69.86%)

Mutual labels: tokenizer, natural-language-processing

Stringi

THE String Processing Package for R (with ICU)

Stars: ✭ 204 (-53.42%)

Mutual labels: natural-language-processing, text-processing

Kagome

Self-contained Japanese Morphological Analyzer written in pure Go

Stars: ✭ 554 (+26.48%)

Mutual labels: korean, tokenizer

Nlpre

Python library for Natural Language Preprocessing (NLPre)

Stars: ✭ 158 (-63.93%)

Mutual labels: natural-language-processing, text-processing

Textvec

Text vectorization tool to outperform TFIDF for classification tasks

Stars: ✭ 167 (-61.87%)

Mutual labels: natural-language-processing, text-processing

Hunspell Dict Ko

Korean spellchecking dictionary for Hunspell

Stars: ✭ 187 (-57.31%)

Mutual labels: korean, natural-language-processing

hama-py

🦛 파이썬 한글 처리 라이브러리. Python Korean Morphological Analyzer

Stars: ✭ 16 (-96.35%)

Mutual labels: korean, text-processing

View All Similar Projects ➔

open-korean-text

Open-source Korean Text Processor / 오픈소스 한국어 처리기 (Official Fork of twitter-korean-text)

Scala/Java library to process Korean text with a Java wrapper. open-korean-text currently provides Korean normalization and tokenization. Please join our community at Google Forum. The intent of this text processor is not limited to short tweet texts.

스칼라로 쓰여진 한국어 처리기입니다. 현재 텍스트 정규화와 형태소 분석, 스테밍을 지원하고 있습니다. 짧은 트윗은 물론이고 긴 글도 처리할 수 있습니다. 개발에 참여하시고 싶은 분은 Google Forum에 가입해 주세요. 사용법을 알고자 하시는 초보부터 코드에 참여하고 싶으신 분들까지 모두 환영합니다.

설치 및 수정하는 방법 상세 안내

open-korean-text의 목표는 빅데이터 등에서 간단한 한국어 처리를 통해 색인어를 추출하는 데에 있습니다. 완전한 수준의 형태소 분석을 지향하지는 않습니다.

open-korean-text는 normalization, tokenization, stemming, phrase extraction 이렇게 네가지 기능을 지원합니다.

정규화 normalization (입니닼ㅋㅋ -> 입니다 ㅋㅋ, 샤릉해 -> 사랑해)

한국어를 처리하는 예시입니닼ㅋㅋㅋㅋㅋ -> 한국어를 처리하는 예시입니다 ㅋㅋ

토큰화 tokenization

한국어를 처리하는 예시입니다 ㅋㅋ -> 한국어Noun, 를Josa, 처리Noun, 하는Verb, 예시Noun, 입니다Adjective(이다), ㅋㅋKoreanParticle

어근화 stemming (입니다 -> 이다)

한국어를 처리하는 예시입니다 ㅋㅋ -> 한국어Noun, 를Josa, 처리Noun, 하다Verb, 예시Noun, 이다Adjective, ㅋㅋKoreanParticle

어구 추출 phrase extraction

한국어를 처리하는 예시입니다 ㅋㅋ -> 한국어, 처리, 예시, 처리하는 예시

Introductory Presentation: Google Slides

Web API Service

open-korean-text-api
이 API 서비스는 Heroku 서버에서 제공되며(Domain: https://open-korean-text.herokuapp.com/) 현재 정규화(normalization), 토큰화(tokenization), 어근화(stemmin), 어구 추출(phrase extract) 서비스를 제공합니다.

각 서비스와 사용법은 다음과 같습니다.
normalize, tokenize, stem, extractPhrases 가 각 서비스의 Action 이 되며 Query parameter 는 text 입니다.

서비스	사용법
정규화	https://open-korean-text-api.herokuapp.com/normalize?text=오픈코리안텍스트
토큰화	https://open-korean-text-api.herokuapp.com/tokenize?text=오픈코리안텍스트
어근화	https://open-korean-text-api.herokuapp.com/stem?text=오픈코리안텍스트
어구 추출	https://open-korean-text-api.herokuapp.com/extractPhrases?text=오픈코리안텍스트

Semantic Versioning

1.0.2 (Major.Minor.Patch)

Major: API change Minor: Processor behavior change Patch: Bug fixes without a behavior change

API

Maven

To include this in your Maven-based JVM project, add the following lines to your pom.xml: / Maven을 이용할 경우 pom.xml에 다음의 내용을 추가하시면 됩니다:

  <dependency>
    <groupId>org.openkoreantext</groupId>
    <artifactId>open-korean-text</artifactId>
    <version>2.1.0</version>
  </dependency>

Maven Repository: http://mvnrepository.com/artifact/org.openkoreantext/open-korean-text

Support for other languages.

Type	Language	Contributor
Wrapper	.net/C#	modamoda
Wrapper	Node JS	Ch0p
Wrapper	Node JS	Youngrok Kim
Wrapper	Python	Jaepil Jeong
Wrapper	Clojure	Seonho Kim
Wrapper	Ruby for Java Version	jun85664396
Wrapper	Ruby for Scala Version	Jaehyun Shin
Porting	Python	Baeg-il Kim
Package	Python Korean NLP	KoNLPy
Package	Elastic Search	socurites
Package	Elastic Search	Jaehyun Shin

Get the source / 소스를 원하시는 경우

Clone the git repo and build using maven. / Git 전체를 클론하고 Maven을 이용하여 빌드합니다.

git clone https://github.com/open-korean-text/open-korean-text.git
cd open-korean-text
mvn compile

Open 'pom.xml' from your favorite IDE.

Basic Usage / 사용 방법

You can find these examples in examples folder. / examples 폴더에 사용 방법 예제 파일이 있습니다.

Running Tests

mvn test will run our unit tests / 모든 유닛 테스트를 실행하려면 mvn test를 이용해 주세요.

Contribution

Refer to the general contribution guide. We will add this project-specific contribution guide later.

설치 및 수정하는 방법 상세 안내

Performance / 처리 속도

Tested on Intel i7 2.3 Ghz

Initial loading time (초기 로딩 시간): 2~4 sec

Average time per parsing a chunk (평균 어절 처리 시간): 0.12 ms

Tweets (Avg length ~50 chars)

Tweets	100K	200K	300K	400K	500K	600K	700K	800K	900K	1M
Time in Seconds	57.59	112.09	165.05	218.11	270.54	328.52	381.09	439.71	492.94	542.12

Average per tweet: 0.54212 ms

Benchmark test by KoNLPy

From http://konlpy.org/ko/v0.4.3/morph/#pos-tagging-with-konlpy

Author

Will Hohyon Ryu (유호현): https://github.com/nlpenguin | https://twitter.com/NLPenguin

Admin Staff

Mingyu Kim (김민규): https://github.com/MechanicKim

License

Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 438

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (10) 🔗