All Projects โ†’ himkt โ†’ Konoha

himkt / Konoha

Licence: mit
๐ŸŒฟ An easy-to-use Japanese Text Processing tool, which makes it possible to switch tokenizers with small changes of code.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Konoha

Textvec
Text vectorization tool to outperform TFIDF for classification tasks
Stars: โœญ 167 (+28.46%)
Mutual labels:  natural-language-processing, text-processing
Pykakasi
NLP: Convert Japanese Kana-kanji sentences into Kana-Roman in simple algorithm.
Stars: โœญ 238 (+83.08%)
Mutual labels:  japanese, natural-language-processing
Fastnlp
fastNLP: A Modularized and Extensible NLP Framework. Currently still in incubation.
Stars: โœญ 2,441 (+1777.69%)
Mutual labels:  natural-language-processing, text-processing
Prenlp
Preprocessing Library for Natural Language Processing
Stars: โœญ 130 (+0%)
Mutual labels:  natural-language-processing, text-processing
Lingua Franca
Mycroft's multilingual text parsing and formatting library
Stars: โœญ 51 (-60.77%)
Mutual labels:  natural-language-processing, text-processing
Nlpre
Python library for Natural Language Preprocessing (NLPre)
Stars: โœญ 158 (+21.54%)
Mutual labels:  natural-language-processing, text-processing
Japanese.js
Util collection for Japanese text processing. Hiraganize, Katakanize, and Romanize.
Stars: โœญ 150 (+15.38%)
Mutual labels:  japanese, text-processing
Stanza Old
Stanford NLP group's shared Python tools.
Stars: โœญ 142 (+9.23%)
Mutual labels:  natural-language-processing, text-processing
Nagisa Tutorial Pycon2019
Code for PyCon JP 2019 talk "Python ใซใ‚ˆใ‚‹ๆ—ฅๆœฌ่ชž่‡ช็„ถ่จ€่ชžๅ‡ฆ็† ใ€œ็ณปๅˆ—ใƒฉใƒ™ใƒชใƒณใ‚ฐใซใ‚ˆใ‚‹ๅฎŸไธ–็•Œใƒ†ใ‚ญใ‚นใƒˆๅˆ†ๆžใ€œ"
Stars: โœญ 46 (-64.62%)
Mutual labels:  japanese, natural-language-processing
Open Korean Text
Open Korean Text Processor - An Open-source Korean Text Processor
Stars: โœญ 438 (+236.92%)
Mutual labels:  natural-language-processing, text-processing
Stringi
THE String Processing Package for R (with ICU)
Stars: โœญ 204 (+56.92%)
Mutual labels:  natural-language-processing, text-processing
Toiro
A comparison tool of Japanese tokenizers
Stars: โœญ 95 (-26.92%)
Mutual labels:  japanese, natural-language-processing
Pynlpl
PyNLPl, pronounced as 'pineapple', is a Python library for Natural Language Processing. It contains various modules useful for common, and less common, NLP tasks. PyNLPl can be used for basic tasks such as the extraction of n-grams and frequency lists, and to build simple language model. There are also more complex data types and algorithms. Moreover, there are parsers for file formats common in NLP (e.g. FoLiA/Giza/Moses/ARPA/Timbl/CQL). There are also clients to interface with various NLP specific servers. PyNLPl most notably features a very extensive library for working with FoLiA XML (Format for Linguistic Annotation).
Stars: โœญ 426 (+227.69%)
Mutual labels:  natural-language-processing, text-processing
Awesome Bert Japanese
๐Ÿ“ A list of pre-trained BERT models for Japanese with word/subword tokenization + vocabulary construction algorithm information
Stars: โœญ 76 (-41.54%)
Mutual labels:  japanese, natural-language-processing
Cogcomp Nlpy
CogComp's light-weight Python NLP annotators
Stars: โœญ 115 (-11.54%)
Mutual labels:  natural-language-processing, text-processing
Spacy Dev Resources
๐Ÿ’ซ Scripts, tools and resources for developing spaCy
Stars: โœญ 123 (-5.38%)
Mutual labels:  natural-language-processing
Nadesiko3
Japanese Programming Language Nadesiko v3 (JavaScript)
Stars: โœญ 125 (-3.85%)
Mutual labels:  japanese
Fnc 1 Baseline
A baseline implementation for FNC-1
Stars: โœญ 123 (-5.38%)
Mutual labels:  natural-language-processing
Awesome Hungarian Nlp
A curated list of NLP resources for Hungarian
Stars: โœญ 121 (-6.92%)
Mutual labels:  natural-language-processing
Rasa Chatbot Templates
RASA chatbot use case boilerplate
Stars: โœญ 127 (-2.31%)
Mutual labels:  natural-language-processing

๐ŸŒฟ Konoha: Simple wrapper of Japanese Tokenizers

GitHub stars

Downloads Downloads Downloads

Build Status Documentation Status Python PyPI GitHub Issues GitHub Pull Requests

Konoha is a Python library for providing easy-to-use integrated interface of various Japanese tokenizers, which enables you to switch a tokenizer and boost your pre-processing.

Supported tokenizers

Also, konoha provides rule-based tokenizers (whitespace, character) and a rule-based sentence splitter.

Quick Start with Docker

Simply run followings on your computer:

docker run --rm -p 8000:8000 -t himkt/konoha  # from DockerHub

Or you can build image on your machine:

git clone https://github.com/himkt/konoha  # download konoha
cd konoha && docker-compose up --build  # build and launch container

Tokenization is done by posting a json object to localhost:8000/api/v1/tokenize. You can also batch tokenize by passing texts: ["๏ผ‘ใค็›ฎใฎๅ…ฅๅŠ›", "๏ผ’ใค็›ฎใฎๅ…ฅๅŠ›"] to the server.

(API documentation is available on localhost:8000/redoc, you can check it using your web browser)

Send a request using curl on your terminal. Note that a path to an endpoint is changed in v4.6.4. Please check our release note (https://github.com/himkt/konoha/releases/tag/v4.6.4).

$ curl localhost:8000/api/v1/tokenize -X POST -H "Content-Type: application/json" \
    -d '{"tokenizer": "mecab", "text": "ใ“ใ‚Œใฏใƒšใƒณใงใ™"}'

{
  "tokens": [
    [
      {
        "surface": "ใ“ใ‚Œ",
        "part_of_speech": "ๅ่ฉž"
      },
      {
        "surface": "ใฏ",
        "part_of_speech": "ๅŠฉ่ฉž"
      },
      {
        "surface": "ใƒšใƒณ",
        "part_of_speech": "ๅ่ฉž"
      },
      {
        "surface": "ใงใ™",
        "part_of_speech": "ๅŠฉๅ‹•่ฉž"
      }
    ]
  ]
}

Installation

I recommend you to install konoha by pip install 'konoha[all]' or pip install 'konoha[all_with_integrations]'. (all_with_integrations will install AllenNLP)

  • Install konoha with a specific tokenizer: pip install 'konoha[(tokenizer_name)].
  • Install konoha with a specific tokenizer and AllenNLP integration: pip install 'konoha[(tokenizer_name),allennlp].
  • Install konoha with a specific tokenizer and remote file support: pip install 'konoha[(tokenizer_name),remote]'

If you want to install konoha with a tokenizer, please install konoha with a specific tokenizer (e.g. konoha[mecab], konoha[sudachi], ...etc) or install tokenizers individually.

Example

Word level tokenization

from konoha import WordTokenizer

sentence = '่‡ช็„ถ่จ€่ชžๅ‡ฆ็†ใ‚’ๅ‹‰ๅผทใ—ใฆใ„ใพใ™'

tokenizer = WordTokenizer('MeCab')
print(tokenizer.tokenize(sentence))
# => [่‡ช็„ถ, ่จ€่ชž, ๅ‡ฆ็†, ใ‚’, ๅ‹‰ๅผท, ใ—, ใฆ, ใ„, ใพใ™]

tokenizer = WordTokenizer('Sentencepiece', model_path="data/model.spm")
print(tokenizer.tokenize(sentence))
# => [โ–, ่‡ช็„ถ, ่จ€่ชž, ๅ‡ฆ็†, ใ‚’, ๅ‹‰ๅผท, ใ—, ใฆใ„ใพใ™]

For more detail, please see the example/ directory.

Remote files

Konoha supports dictionary and model on cloud storage (currently supports Amazon S3). It requires installing konoha with the remote option, see Installation.

# download user dictionary from S3
word_tokenizer = WordTokenizer("mecab", user_dictionary_path="s3://abc/xxx.dic")
print(word_tokenizer.tokenize(sentence))

# download system dictionary from S3
word_tokenizer = WordTokenizer("mecab", system_dictionary_path="s3://abc/yyy")
print(word_tokenizer.tokenize(sentence))

# download model file from S3
word_tokenizer = WordTokenizer("sentencepiece", model_path="s3://abc/zzz.model")
print(word_tokenizer.tokenize(sentence))

Sentence level tokenization

from konoha import SentenceTokenizer

sentence = "็งใฏ็Œซใ ใ€‚ๅๅ‰ใชใ‚“ใฆใ‚‚ใฎใฏใชใ„ใ€‚ใ ใŒ๏ผŒใ€Œใ‹ใ‚ใ„ใ„ใ€‚ใใ‚Œใงๅๅˆ†ใ ใ‚ใ†ใ€ใ€‚"

tokenizer = SentenceTokenizer()
print(tokenizer.tokenize(sentence))
# => ['็งใฏ็Œซใ ใ€‚', 'ๅๅ‰ใชใ‚“ใฆใ‚‚ใฎใฏใชใ„ใ€‚', 'ใ ใŒ๏ผŒใ€Œใ‹ใ‚ใ„ใ„ใ€‚ใใ‚Œใงๅๅˆ†ใ ใ‚ใ†ใ€ใ€‚']

AllenNLP integration

Konoha provides AllenNLP integration, it enables users to specify konoha tokenizer in a Jsonnet config file. By running allennlp train with --include-package konoha, you can train a model using konoha tokenizer!

For example, konoha tokenizer is specified in xxx.jsonnet like following:

{
  "dataset_reader": {
    "lazy": false,
    "type": "text_classification_json",
    "tokenizer": {
      "type": "konoha",  // <-- konoha here!!!
      "tokenizer_name": "janome",
    },
    "token_indexers": {
      "tokens": {
        "type": "single_id",
        "lowercase_tokens": true,
      },
    },
  },
  ...
  "model": {
  ...
  },
  "trainer": {
  ...
  }
}

After finishing other sections (e.g. model config, trainer config, ...etc), allennlp train config/xxx.jsonnet --include-package konoha --serialization-dir yyy works! (remember to include konoha by --include-package konoha)

For more detail, please refer my blog article (in Japanese, sorry).

Test

python -m pytest

Article

Acknowledgement

Sentencepiece model used in test is provided by @yoheikikuta. Thanks!

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].