All Projects → ks-shim → klay

ks-shim / klay

Licence: Apache-2.0 license
KLAY - Korean Language AnalYzer (한국어 형태소 분석기)

Programming Languages

java
68154 projects - #9 most used programming language

Projects that are alternatives of or similar to klay

PyKOMORAN
(Beta) PyKOMORAN is wrapped KOMORAN in Python using Py4J.
Stars: ✭ 38 (+100%)
Mutual labels:  korean, komoran
Elasticsearch Analysis Openkoreantext
Korean analysis plugin that integrates open-korean-text module into elasticsearch.
Stars: ✭ 101 (+431.58%)
Mutual labels:  analyzer, korean
KoParadigm
KoParadigm: Korean Inflectional Paradigm Generator
Stars: ✭ 48 (+152.63%)
Mutual labels:  morphology, korean
frog
Frog is an integration of memory-based natural language processing (NLP) modules developed for Dutch. All NLP modules are based on Timbl, the Tilburg memory-based learning software package.
Stars: ✭ 70 (+268.42%)
Mutual labels:  morphology
Basic-Image-Processing
Implementation of Basic Digital Image Processing Tasks in Python / OpenCV
Stars: ✭ 102 (+436.84%)
Mutual labels:  morphology
Morphos-Blade
Morphos adapter for Blade
Stars: ✭ 32 (+68.42%)
Mutual labels:  morphology
koshort
(deprecated) 🐱 koshort is a Python package for Korean internet spoken language crawling and processing... or maybe Korean domestic cat.
Stars: ✭ 62 (+226.32%)
Mutual labels:  korean
treestoolbox
TREES toolbox
Stars: ✭ 20 (+5.26%)
Mutual labels:  morphology
KoSpacing
Automatic Korean word spacing with R
Stars: ✭ 76 (+300%)
Mutual labels:  korean
UniqueBible
A cross-platform bible application, integrated with high-quality resources and amazing features, running offline in Windows, macOS and Linux
Stars: ✭ 61 (+221.05%)
Mutual labels:  morphology
DeepMorphy
Морфологический анализатор для русского языка на C# для .NET
Stars: ✭ 23 (+21.05%)
Mutual labels:  morphology
mlmorph
Malayalam Morphological Analyzer using Finite State Transducer
Stars: ✭ 40 (+110.53%)
Mutual labels:  morphology
alyahmor
Arabic flexionnal morphology generator
Stars: ✭ 22 (+15.79%)
Mutual labels:  morphology
OpenHebrewBible
Open Hebrew Bible Project; aligning BHS with WLC; bridging ETCBC, OpenScriptures & Berean data on Hebrew Bible
Stars: ✭ 43 (+126.32%)
Mutual labels:  morphology
Texture-KR-Wiki
Texture (AsyncDisplayKit) Wiki - 한국어
Stars: ✭ 42 (+121.05%)
Mutual labels:  korean
awesome-cytodata
A curated list of awesome cytodata resources
Stars: ✭ 40 (+110.53%)
Mutual labels:  morphology
syntaxdot
Neural syntax annotator, supporting sequence labeling, lemmatization, and dependency parsing.
Stars: ✭ 32 (+68.42%)
Mutual labels:  morphology
modular-assemblies
[NeurIPS 2019] Code for the paper "Learning to Control Self-Assembling Morphologies: A Study of Generalization via Modularity"
Stars: ✭ 98 (+415.79%)
Mutual labels:  morphology
RivWidthCloudPaper
A Google Earth Engine based algorithm that extracts river centerlines and widths from satellite images
Stars: ✭ 62 (+226.32%)
Mutual labels:  morphology
retinal-exudates-detection
exudates detection using hybrid approach (Image Morphology & Machine Learning)
Stars: ✭ 53 (+178.95%)
Mutual labels:  morphology

1. KLAY

Build Status Coverage Status

Korean Language AnalYzer using KOMORAN's dictionaries.

  • korean morphology analysis
  • 한국어 형태소 분석기 입니다.
  • 목표
    • 좀 더 빠른 분석 속도
    • 좀 더 자바스럽게 ...
    • 품질 유지 (추후 품질 개선 계획)
  • 개발 시작일 : 2019. 02 ~
    • version : 0.1 (2019.02.26)
    • version : 0.3 (2019.03.18)
    • version : 0.3.2 (2022.01.18)
    • version : 0.3.6 (2022.09.07) <-- current
  • KOMORAN의 사전을 기반으로 분석하며, 사용하는 Data structure와 분석 방식은 상이합니다.
  • Data Structure : KLAY의 분석 방식에 맞게 수정한 Lucene의 Trie를 사용합니다.
  • KLAY is a thread-safe analyzer. (멀티 쓰레드 환경에서의 사용을 권장합니다.)

2. Architecture

Performance와 동시에 확장성을 고려하였으며 Readability에 많은 신경을 썼습니다. 그래서 조금 더 자바(Java)스럽게 Design하였습니다.

2-1. Tokenization

Chain of Responsibiility 패턴을 사용하여 구현하였습니다. ChainedTokenizationRule 인터페이스를 구현하여 Rule을 쉽게 추가할 수 있습니다. 현재는 아래와 같은 Rule을 순차적으로 적용하고 있습니다.

  • UserDictionaryMatchRule : 사용자 사전에 매칭하는 Rule
  • CharacterTypeAndLengthLimitRule : 문자타입 및 길이 제한 Rule

tokenization_diagram

2-2. Analysis

마찬가지로 Chain of Responsibility 패턴을 사용하여 구현하였습니다. ChainedAnalysisRule 인터페이스를 구현하여 Rule을 쉽게 추가할 수 있습니다. 현재는 아래와 같은 Rule을 순차적으로 적용하고 있습니다.

  • CanSkipRule : 분석없이 생략할 수 있는 Rule
  • FWDRule : 기분석 사전으로 Fully 매칭하는 Rule
  • AllPossibleCandidateRule : 미등록어 추정 Rule
  • NARule : 분석 불가 Rule

HMM(Viterbi)는 MorphSequence 클래스를 사용하여 계산되어집니다.

analysis_diagram

2-3. Dictionary

Lucene의 Trie를 변형하여 적용하였습니다.

dictionary_diagram

3. Example

    //***********************************************************************
    // 1. configuration and creating Klay object ...
    //***********************************************************************
    Klay klay = new Klay(Paths.get("data/configuration/klay.conf"));

    //***********************************************************************
    // 2. start morphological analysis.
    //***********************************************************************
    String text = "너무기대안하고갔나....................재밌게봤다";
    Morphs morphs = klay.doKlay(text);

    //***********************************************************************
    // 3. print result.
    //***********************************************************************
    Iterator<Morph> iter = morphs.iterator();
    while(iter.hasNext()) {
        System.out.println(iter.next());
    }

4. Performance

4-1. 사양 및 데이터

  • 프로세서 : Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz, 4008Mhz, 4코어, 8 논리 프로세서
  • 메모리 : 32.0 GB
  • 분석 데이터 위치 : data/performance/test.txt
  • 분석 데이터 건수 : 199,992 건

4-2. 결과 및 코드

  • 사전 로딩 : 0.284 (s)
  • 분석 시간 : 16.815 (s)
    String src = "data/performance/test.txt";
    Klay klay = new Klay(Paths.get("data/configuration/klay.conf"));

    StopWatch watch = new StopWatch();
    watch.start();
    int count = 0;
    try (BufferedReader in = new BufferedReader(new FileReader(src))) {
        String line = null;
        while((line = in.readLine()) != null) {
            line = line.trim();
            if(line.isEmpty()) continue;

            klay.doKlay(line);
            System.out.print("\r" + ++count);
        }
    }
    watch.stop();
    System.out.println("Analysis Time : " + watch.getTime(TimeUnit.MILLISECONDS) / 1000.0 + " (s)");

5. Elasticsearch Plugin Download

6. Resources Download

7. KLAY for python

8. Maven

<dependency>
  <groupId>io.github.ks-shim.klay</groupId>
  <artifactId>klay-common</artifactId>
  <version>0.3.8</version>
</dependency>
<dependency>
  <groupId>io.github.ks-shim.klay</groupId>
  <artifactId>klay-dictionary</artifactId>
  <version>0.3.8</version>
</dependency>
<dependency>
  <groupId>io.github.ks-shim.klay</groupId>
  <artifactId>klay-core</artifactId>
  <version>0.3.8</version>
</dependency>
<repositories>
  <repository>
      <id>oos</id>
      <url>https://s01.oss.sonatype.org/content/groups/public/</url>
  </repository>
</repositories>

9. Dictionary build

  • dictionary-build 모듈 : klay.dictionary.build.DictionaryBuilder 실행
public static void main(String[] args) throws Exception {

    // 1. 사전에 환경설정 파일의 Raw 사전 정보를 변경합니다.
    Properties config = new Properties();
    config.load(Files.newInputStream(Paths.get("data/configuration/klay.conf")));

    // 2. 관측확률/전이확률에 사용한 pos-frequency 정보를 읽어들입니다.
    DictionaryTextSource posFreqSource = new DictionaryTextSource(Paths.get(config.getProperty("dictionary.grammar.path")));

    // 3. 관측확률 사전의 소스/타겟 정보를 생성합니다.
    DictionaryTextSource[] emissionSources = {
            // *** must build DIC_WORD first !!
            new DictionaryTextSource(
                    Paths.get(config.getProperty("dictionary.word.path")), DictionaryTextSource.DictionaryType.DIC_WORD),
            new DictionaryTextSource(
                    Paths.get(config.getProperty("dictionary.irregular.path")), DictionaryTextSource.DictionaryType.DIC_IRREGULAR)
    };
    DictionaryBinaryTarget emissionTarget =
            new DictionaryBinaryTarget(Paths.get(config.getProperty("dictionary.emission.path")));

    // 4. 전이확률 사전의 소스/타켓 정보를 생성합니다.
    DictionaryTextSource transitionSource =
            new DictionaryTextSource(
                    Paths.get(config.getProperty("dictionary.grammar.path")), DictionaryTextSource.DictionaryType.GRAMMAR);
    DictionaryBinaryTarget transitionTarget =
            new DictionaryBinaryTarget(Paths.get(config.getProperty("dictionary.transition.path")));

    // 5. 빌더를 생성하고 빌딩을 시작합니다.
    DictionaryBuilder builder = new DictionaryBuilder.Builder()
            .posFreqSource(posFreqSource)
            .emissionSourcesAndTarget(emissionSources, emissionTarget)
            .transitionSourceAndTarget(transitionSource, transitionTarget)
            .build();

    builder.buildAll();
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].