All Projects → WorksApplications → elasticsearch-sudachi

WorksApplications / elasticsearch-sudachi

Licence: Apache-2.0 license
The Japanese analysis plugin for elasticsearch

Programming Languages

java
68154 projects - #9 most used programming language

Projects that are alternatives of or similar to elasticsearch-sudachi

Elastiknn
Elasticsearch plugin for nearest neighbor search. Store vectors and run similarity search using exact and approximate algorithms.
Stars: ✭ 139 (+7.75%)
Mutual labels:  elasticsearch-plugin
RKOMORAN
RKOMORAN is KOMORAN wrapper for R users
Stars: ✭ 15 (-88.37%)
Mutual labels:  morphological-analyser
GrammarEngine
Грамматический Словарь Русского Языка (+ английский, японский, etc)
Stars: ✭ 68 (-47.29%)
Mutual labels:  morphological-analyser
Graph Aided Search
Elasticsearch plugin offering Neo4j integration for Personalized Search
Stars: ✭ 153 (+18.6%)
Mutual labels:  elasticsearch-plugin
sinling
A collection of NLP tools for Sinhalese (සිංහල).
Stars: ✭ 38 (-70.54%)
Mutual labels:  morphological-analyser
docker-curator
docker images for elasticsearch curator
Stars: ✭ 23 (-82.17%)
Mutual labels:  elasticsearch-plugin
Performance Analyzer
📈 OpenDistro for Elasticsearch Performance Analyzer
Stars: ✭ 128 (-0.78%)
Mutual labels:  elasticsearch-plugin
elasticsearch plugin
Nodeos plugin for archiving blockchain data into Elasticsearch.
Stars: ✭ 57 (-55.81%)
Mutual labels:  elasticsearch-plugin
Elasticsearch
Elasticsearch是一个实时的分布式搜索和分析引擎,
Stars: ✭ 23 (-82.17%)
Mutual labels:  elasticsearch-plugin
PyKOMORAN
(Beta) PyKOMORAN is wrapped KOMORAN in Python using Py4J.
Stars: ✭ 38 (-70.54%)
Mutual labels:  morphological-analyser
Mirage
🎨 GUI for simplifying Elasticsearch Query DSL
Stars: ✭ 2,143 (+1561.24%)
Mutual labels:  elasticsearch-plugin
Elastik Nearest Neighbors
Go to: https://github.com/alexklibisz/elastiknn
Stars: ✭ 249 (+93.02%)
Mutual labels:  elasticsearch-plugin
Morse.jl
Paper: Morphological Analysis Using a Sequence Decoder
Stars: ✭ 14 (-89.15%)
Mutual labels:  morphological-analyser
Esparser
PHP write SQL to convert DSL to query Elasticsearch
Stars: ✭ 142 (+10.08%)
Mutual labels:  elasticsearch-plugin
frog
Frog is an integration of memory-based natural language processing (NLP) modules developed for Dutch. All NLP modules are based on Timbl, the Tilburg memory-based learning software package.
Stars: ✭ 70 (-45.74%)
Mutual labels:  morphological-analyser
Elasticsearch Dataformat
Excel/CSV/BulkJSON downloads on Elasticsearch.
Stars: ✭ 135 (+4.65%)
Mutual labels:  elasticsearch-plugin
elasticsearch-langfield
This plugin provides a useful feature for multi-language
Stars: ✭ 13 (-89.92%)
Mutual labels:  elasticsearch-plugin
rosette-elasticsearch-plugin
Document Enrichment plugin for Elasticsearch
Stars: ✭ 25 (-80.62%)
Mutual labels:  elasticsearch-plugin
elasticsearch-report-engine
An Elasticsearch plugin to return query results as either PDF,HTML or CSV.
Stars: ✭ 49 (-62.02%)
Mutual labels:  elasticsearch-plugin
NMeCab
Japanese morphological analyzer on .NET
Stars: ✭ 65 (-49.61%)
Mutual labels:  morphological-analyser

analysis-sudachi

analysis-sudachi is an Elasticsearch plugin for tokenization of Japanese text using Sudachi the Japanese morphological analyzer.

build Quality Gate Status

What's new?

  • version 2.1.0

    • Added a new property additional_settings to write Sudachi settings directly in config
    • Added support for specifying Elasticsearch version at build time
  • version 2.0.3

    • Fix duplicated tokens for OOVs with sudachi_split filter's extended mode
  • version 2.0.2

    • Upgrade Sudachi to 0.4.3
      • Fix overrun with surrogate pairs
  • version 2.0.1

    • Upgrade Sudachi to 0.4.2
      • Fix buffer overrun with character normalization
  • version 2.0.0

    • New mode split_mode was added
    • New filter sudachi_split was added instead of mode
    • mode was deperecated
    • Upgrade Sudachi morphological analyzer to 0.4.1
    • Words containing periods are no longer split
    • Fix a bug causing wrong offsets with icu_normalizer
  • version 1.3.2

    • Upgrade Sudachi morphological analyzer to 0.3.1
  • version 1.3.1

    • Upgrade Sudachi morphological analyzer to 0.3.0
    • Minor bug fix
  • version 1.3.0

    • Upgrade Sudachi morphological analyzer to 0.2.0
    • Import Sudachi from maven central repository
    • Minor bug fix
  • version 1.2.0

    • Upgrading Sudachi morphological analyzer to 0.2.0-SNAPSHOT
    • New filter sudachi_normalizedform was added; see sudachi_normalizedform
    • Default normalization behavior was changed; neather baseform filter and normalziedform filter not applied
    • sudachi_readingform filter was changed with new romaji mappings based on MS-IME
  • version 1.1.0

  • version 1.0.0

    • first release

Build (if necessary)

  1. Build analysis-sudachi.
   $ ./gradlew -PelasticsearchVersion=7.10.1 build
  • develop branch: for Elasticsearch 7.8 or later
  • es7.4-7.7 branch: for Elasticsearch 7.4, 7.5, 7.6, 7.7
  • es7.0-7.3 branch: for Elasticsearch 7.0, 7.1, 7.2, 7.3
  • es6.8 branch: for Elasticsearch 6.8
  • es5.6 branch: for Elasticsearch 5.6

Installation

  1. Move current dir to $ES_HOME

  2. Install the Plugin

    a. Using the release package

    $ bin/elasticsearch-plugin install https://github.com/WorksApplications/elasticsearch-sudachi/releases/download/v2.1.0/analysis-sudachi-7.10.1-2.1.0.zip
    

    b. Using self-build package

    $ bin/elasticsearch-plugin install file:///path/to/analysis-sudachi-7.10.1-2.1.1-SNAPSHOT.zip
    

    (Specify the absolute path in URI format)

  3. Download sudachi dictionary archive from https://github.com/WorksApplications/SudachiDict

  4. Extract dic file and place it to config/sudachi/system_core.dic (You must install system_core.dic in this place if you use Elasticsearch 7.6 or later)

  5. Execute "bin/elasticsearch"

Update Sudachi

If you want to update Sudachi that is included in a plugin you have installed, do the following

  1. Download the latest version of Sudachi from the release page.
  2. Extract the Sudachi JAR file from the zip.
  3. Delete the sudachi JAR file in $ES_HOME/plugins/analysis-sudachi and replace it with the JAR file you extracted in step 2.

Configuration

  • split_mode: Select splitting mode of Sudachi. (A, B, C) (string, default: C)
    • C: Extracts named entities
      • Ex) 選挙管理委員会
    • B: Into the middle units
      • Ex) 選挙,管理,委員会
    • A: The shortest units equivalent to the UniDic short unit
      • Ex) 選挙,管理,委員,会
  • discard_punctuation: Select to discard punctuation or not. (bool, default: true)
  • settings_path: Sudachi setting file path. The path may be absolute or relative; relative paths are resolved with respect to es_config. (string, default: null)
  • resources_path: Sudachi dictionary path. The path may be absolute or relative; relative paths are resolved with respect to es_config. (string, default: null)
  • additional_settings: Describes a configuration JSON string for Sudachi. This JSON string will be merged into the default configuration. If this property is set, settings_path will be ignored.

Example

{
  "settings": {
    "index": {
      "analysis": {
        "tokenizer": {
          "sudachi_tokenizer": {
            "type": "sudachi_tokenizer",
            "split_mode": "C",
            "discard_punctuation": true,
            "resources_path": "/etc/elasticsearch/sudachi"
          }
        },
        "analyzer": {
          "sudachi_analyzer": {
            "filter": [],
            "tokenizer": "sudachi_tokenizer",
            "type": "custom"
          }
        }
      }
    }
  }
}

Dictionary

You can specify the dictionary either in the file specified by settings_path or by additional_settings.

Example

{
  "settings": {
    "index": {
      "analysis": {
        "tokenizer": {
          "sudachi_tokenizer": {
            "type": "sudachi_tokenizer",
            "additional_settings": "{\"systemDict\":\"system_full.dic\",\"userDict\":[\"user.dic\"]}"
          }
        },
        "analyzer": {
          "sudachi_analyzer": {
            "filter": [],
            "tokenizer": "sudachi_tokenizer",
            "type": "custom"
          }
        }
      }
    }
  }
}

Filters

sudachi_split

This filter works like mode of kuromoji.

  • search: Additional segmentation useful for search. (Use C and A mode)
    • Ex)関西国際空港, 関西, 国際, 空港 / アバラカダブラ
  • extended: Similar to search mode, but also unigram unknown words.
    • Ex)関西国際空港, 関西, 国際, 空港 / アバラカダブラ, ア, バ, ラ, カ, ダ, ブ, ラ

PUT sudachi_sample

{
  "settings": {
    "index": {
      "analysis": {
        "tokenizer": {
          "sudachi_tokenizer": {
            "type": "sudachi_tokenizer"
          }
        },
        "analyzer": {
          "sudachi_analyzer": {
            "filter": ["my_searchfilter" ],
            "tokenizer": "sudachi_tokenizer",
            "type": "custom"
          }
        },
        "filter":{
          "my_searchfilter": {
            "type": "sudachi_split",
            "mode": "search"
          }
        }
      }
    }
  }
}

POST sudachi_sample

{
    "analyzer": "sudachi_analyzer",
    "text": "関西国際空港"
}

Which responds with:

{
  "tokens" : [
    {
      "token" : "関西国際空港",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "word",
      "position" : 0,
      "positionLength" : 3
    },
    {
      "token" : "関西",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "国際",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "空港",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "word",
      "position" : 2
    }
  ]
}

sudachi_part_of_speech

The sudachi_part_of_speech token filter removes tokens that match a set of part-of-speech tags. It accepts the following setting:

The stopatgs is an array of part-of-speech and/or inflection tags that should be removed. It defaults to the stoptags.txt file embedded in the lucene-analysis-sudachi.jar.

Sudachi POS information is a csv list, consisting 6 items;

  • 1-4 part-of-speech hierarchy (品詞階層)
  • 5 inflectional type (活用型)
  • 6 inflectional form (活用形)

With the stoptags, you can filter out the result in any of these forward matching forms;

  • 1 - e.g., 名詞
  • 1,2 - e.g., 名詞,固有名詞
  • 1,2,3 - e.g., 名詞,固有名詞,地名
  • 1,2,3,4 - e.g., 名詞,固有名詞,地名,一般
  • 5 - e.g., 五段-カ行
  • 6 - e.g., 終止形-一般
  • 5,6 - e.g., 五段-カ行,終止形-一般

PUT sudachi_sample

{
  "settings": {
    "index": {
      "analysis": {
        "tokenizer": {
          "sudachi_tokenizer": {
            "type": "sudachi_tokenizer"
          }
        },
        "analyzer": {
          "sudachi_analyzer": {
            "filter": [ "my_posfilter" ],
            "tokenizer": "sudachi_tokenizer",
            "type": "custom"
          }
        },
        "filter":{
          "my_posfilter":{
            "type":"sudachi_part_of_speech",
            "stoptags":[
              "助詞",
              "助動詞",
              "補助記号,句点",
              "補助記号,読点"
            ]
          }
        }
      }
    }
  }
}

POST sudachi_sample

{
  "analyzer": "sudachi_analyzer",
  "text": "寿司がおいしいね"
}

Which responds with:

{
  "tokens": [
    {
      "token": "寿司",
      "start_offset": 0,
      "end_offset": 2,
      "type": "word",
      "position": 0
    },
    {
      "token": "美味しい",
      "start_offset": 3,
      "end_offset": 7,
      "type": "word",
      "position": 2
    }
  ]
}

sudachi_ja_stop

The sudachi_ja_stop token filter filters out Japanese stopwords (japanese), and any other custom stopwords specified by the user. This filter only supports the predefined japanese stopwords list. If you want to use a different predefined list, then use the stop token filter instead.

PUT sudachi_sample

{
  "settings": {
    "index": {
      "analysis": {
        "tokenizer": {
          "sudachi_tokenizer": {
            "type": "sudachi_tokenizer"
          }
        },
        "analyzer": {
          "sudachi_analyzer": {
            "filter": [ "my_stopfilter" ],
            "tokenizer": "sudachi_tokenizer",
            "type": "custom"
          }
        },
        "filter":{
          "my_stopfilter":{
            "type":"sudachi_ja_stop",
            "stopwords":[
              "_japanese_",
              "",
              "です"
            ]
          }
        }
      }
    }
  }
}

POST sudachi_sample

{
  "analyzer": "sudachi_analyzer",
  "text": "私は宇宙人です。"
}

Which responds with:

{
  "tokens": [
    {
      "token": "",
      "start_offset": 0,
      "end_offset": 1,
      "type": "word",
      "position": 0
    },
    {
      "token": "宇宙",
      "start_offset": 2,
      "end_offset": 4,
      "type": "word",
      "position": 2
    },
    {
      "token": "",
      "start_offset": 4,
      "end_offset": 5,
      "type": "word",
      "position": 3
    }
  ]
}

sudachi_baseform

The sudachi_baseform token filter replaces terms with their SudachiBaseFormAttribute. This acts as a lemmatizer for verbs and adjectives.

PUT sudachi_sample

{
  "settings": {
    "index": {
      "analysis": {
        "tokenizer": {
          "sudachi_tokenizer": {
            "type": "sudachi_tokenizer"
          }
        },
        "analyzer": {
          "sudachi_analyzer": {
            "filter": [ "sudachi_baseform" ],
            "tokenizer": "sudachi_tokenizer",
            "type": "custom"
          }
        }
      }
    }
  }
}

POST sudachi_sample

{
  "analyzer": "sudachi_analyzer",
  "text": "飲み"
}

Which responds with:

{
  "tokens": [
    {
      "token": "飲む",
      "start_offset": 0,
      "end_offset": 2,
      "type": "word",
      "position": 0
    }
  ]
}

sudachi_normalizedform

The sudachi_normalizedform token filter replaces terms with their SudachiNormalizedFormAttribute. This acts as a normalizer for spelling variants.

This filter lemmatizes verbs and adjectives too. You don't need to use sudachi_baseform filter with this filter.

PUT sudachi_sample

{
  "settings": {
    "index": {
      "analysis": {
        "tokenizer": {
          "sudachi_tokenizer": {
            "type": "sudachi_tokenizer"
          }
        },
        "analyzer": {
          "sudachi_analyzer": {
            "filter": [ "sudachi_normalizedform" ],
            "tokenizer": "sudachi_tokenizer",
            "type": "custom"
          }
        }
      }
    }
  }
}

POST sudachi_sample

{
  "analyzer": "sudachi_analyzer",
  "text": "呑み"
}

Which responds with:

{
  "tokens": [
    {
      "token": "飲む",
      "start_offset": 0,
      "end_offset": 2,
      "type": "word",
      "position": 0
    }
  ]
}

sudachi_readingform

Convert to katakana or romaji reading. The sudachi_readingform token filter replaces the token with its reading form in either katakana or romaji. It accepts the following setting:

use_romaji

Whether romaji reading form should be output instead of katakana. Defaults to false.

When using the pre-defined sudachi_readingform filter, use_romaji is set to true. The default when defining a custom sudachi_readingform, however, is false. The only reason to use the custom form is if you need the katakana reading form:

PUT sudachi_sample

{
  "settings": {
    "index": {
      "analysis": {
        "filter": {
          "romaji_readingform": {
            "type": "sudachi_readingform",
            "use_romaji": true
          },
          "katakana_readingform": {
            "type": "sudachi_readingform",
            "use_romaji": false
          }
        },
        "tokenizer": {
          "sudachi_tokenizer": {
            "type": "sudachi_tokenizer"
          }
        },
        "analyzer": {
          "romaji_analyzer": {
            "tokenizer": "sudachi_tokenizer",
            "filter": [ "romaji_readingform" ]
          },
          "katakana_analyzer": {
            "tokenizer": "sudachi_tokenizer",
            "filter": [ "katakana_readingform" ]
          }
        }
      }
    }
  }
}

POST sudachi_sample

{
  "analyzer": "katakana_analyzer",
  "text": "寿司"
}

Returns スシ.

{
  "analyzer": "romaji_analyzer",
  "text": "寿司"
}

Returns susi.

License

Copyright (c) 2017-2020 Works Applications Co., Ltd. Originally under elasticsearch, https://www.elastic.co/jp/products/elasticsearch Originally under lucene, https://lucene.apache.org/

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].