himkt / Konoha
Programming Languages
Projects that are alternatives of or similar to Konoha
๐ฟ Konoha: Simple wrapper of Japanese Tokenizers
Konoha
is a Python library for providing easy-to-use integrated interface of various Japanese tokenizers,
which enables you to switch a tokenizer and boost your pre-processing.
Supported tokenizers
Also, konoha
provides rule-based tokenizers (whitespace, character) and a rule-based sentence splitter.
Quick Start with Docker
Simply run followings on your computer:
docker run --rm -p 8000:8000 -t himkt/konoha # from DockerHub
Or you can build image on your machine:
git clone https://github.com/himkt/konoha # download konoha
cd konoha && docker-compose up --build # build and launch container
Tokenization is done by posting a json object to localhost:8000/api/v1/tokenize
.
You can also batch tokenize by passing texts: ["๏ผใค็ฎใฎๅ
ฅๅ", "๏ผใค็ฎใฎๅ
ฅๅ"]
to the server.
(API documentation is available on localhost:8000/redoc
, you can check it using your web browser)
Send a request using curl
on your terminal.
Note that a path to an endpoint is changed in v4.6.4.
Please check our release note (https://github.com/himkt/konoha/releases/tag/v4.6.4).
$ curl localhost:8000/api/v1/tokenize -X POST -H "Content-Type: application/json" \
-d '{"tokenizer": "mecab", "text": "ใใใฏใใณใงใ"}'
{
"tokens": [
[
{
"surface": "ใใ",
"part_of_speech": "ๅ่ฉ"
},
{
"surface": "ใฏ",
"part_of_speech": "ๅฉ่ฉ"
},
{
"surface": "ใใณ",
"part_of_speech": "ๅ่ฉ"
},
{
"surface": "ใงใ",
"part_of_speech": "ๅฉๅ่ฉ"
}
]
]
}
Installation
I recommend you to install konoha by pip install 'konoha[all]'
or pip install 'konoha[all_with_integrations]'
.
(all_with_integrations
will install AllenNLP
)
- Install konoha with a specific tokenizer:
pip install 'konoha[(tokenizer_name)]
. - Install konoha with a specific tokenizer and AllenNLP integration:
pip install 'konoha[(tokenizer_name),allennlp]
. - Install konoha with a specific tokenizer and remote file support:
pip install 'konoha[(tokenizer_name),remote]'
If you want to install konoha with a tokenizer, please install konoha with a specific tokenizer
(e.g. konoha[mecab]
, konoha[sudachi]
, ...etc) or install tokenizers individually.
Example
Word level tokenization
from konoha import WordTokenizer
sentence = '่ช็ถ่จ่ชๅฆ็ใๅๅผทใใฆใใพใ'
tokenizer = WordTokenizer('MeCab')
print(tokenizer.tokenize(sentence))
# => [่ช็ถ, ่จ่ช, ๅฆ็, ใ, ๅๅผท, ใ, ใฆ, ใ, ใพใ]
tokenizer = WordTokenizer('Sentencepiece', model_path="data/model.spm")
print(tokenizer.tokenize(sentence))
# => [โ, ่ช็ถ, ่จ่ช, ๅฆ็, ใ, ๅๅผท, ใ, ใฆใใพใ]
For more detail, please see the example/
directory.
Remote files
Konoha supports dictionary and model on cloud storage (currently supports Amazon S3).
It requires installing konoha with the remote
option, see Installation.
# download user dictionary from S3
word_tokenizer = WordTokenizer("mecab", user_dictionary_path="s3://abc/xxx.dic")
print(word_tokenizer.tokenize(sentence))
# download system dictionary from S3
word_tokenizer = WordTokenizer("mecab", system_dictionary_path="s3://abc/yyy")
print(word_tokenizer.tokenize(sentence))
# download model file from S3
word_tokenizer = WordTokenizer("sentencepiece", model_path="s3://abc/zzz.model")
print(word_tokenizer.tokenize(sentence))
Sentence level tokenization
from konoha import SentenceTokenizer
sentence = "็งใฏ็ซใ ใๅๅใชใใฆใใฎใฏใชใใใ ใ๏ผใใใใใใใใใงๅๅใ ใใใใ"
tokenizer = SentenceTokenizer()
print(tokenizer.tokenize(sentence))
# => ['็งใฏ็ซใ ใ', 'ๅๅใชใใฆใใฎใฏใชใใ', 'ใ ใ๏ผใใใใใใใใใงๅๅใ ใใใใ']
AllenNLP integration
Konoha provides AllenNLP integration, it enables users to specify konoha tokenizer in a Jsonnet config file.
By running allennlp train
with --include-package konoha
, you can train a model using konoha tokenizer!
For example, konoha tokenizer is specified in xxx.jsonnet
like following:
{
"dataset_reader": {
"lazy": false,
"type": "text_classification_json",
"tokenizer": {
"type": "konoha", // <-- konoha here!!!
"tokenizer_name": "janome",
},
"token_indexers": {
"tokens": {
"type": "single_id",
"lowercase_tokens": true,
},
},
},
...
"model": {
...
},
"trainer": {
...
}
}
After finishing other sections (e.g. model config, trainer config, ...etc), allennlp train config/xxx.jsonnet --include-package konoha --serialization-dir yyy
works!
(remember to include konoha by --include-package konoha
)
For more detail, please refer my blog article (in Japanese, sorry).
Test
python -m pytest
Article
- Introducing Konoha (in Japanese): ใใผใฏใใคใถใใใๆใใซๅใๆฟใใใฉใคใใฉใช konoha ใไฝใฃใ
- Implementing AllenNLP integration (in Japanese): ๆฅๆฌ่ช่งฃๆใใผใซ Konoha ใซ AllenNLP ้ฃๆบๆฉ่ฝใๅฎ่ฃ ใใ
Acknowledgement
Sentencepiece model used in test is provided by @yoheikikuta. Thanks!