All Projects → megagonlabs → bunkai

megagonlabs / bunkai

Licence: Apache-2.0 license
Sentence boundary disambiguation tool for Japanese texts (日本語文境界判定器)

Programming Languages

python
139335 projects - #7 most used programming language
Makefile
30231 projects

Projects that are alternatives of or similar to bunkai

vimdoc-ja-working
vimdoc-ja working repository
Stars: ✭ 70 (-54.55%)
Mutual labels:  japanese
kotoba
A Discord bot for helping with learning Japanese.
Stars: ✭ 118 (-23.38%)
Mutual labels:  japanese
akka-doc-ja
Akka Japanese Documentation
Stars: ✭ 25 (-83.77%)
Mutual labels:  japanese
google-news-scraper
Google News Scraper for languages like Japanese, Chinese... [VPN Support]
Stars: ✭ 88 (-42.86%)
Mutual labels:  japanese
PragmaticSegmenterNet
Port of PragmaticSegmenter for sentence boundary detection
Stars: ✭ 25 (-83.77%)
Mutual labels:  sentence-boundary-detection
textlint-rule-no-synonyms
同義語を表記ゆれをチェックするtextlintルール
Stars: ✭ 18 (-88.31%)
Mutual labels:  japanese
Nihonoari-App
A little and minimalist Japanese Kana training
Stars: ✭ 66 (-57.14%)
Mutual labels:  japanese
kuzushiji-recognition
Kuzushiji Recognition Kaggle 2019. Build a DL model to transcribe ancient Kuzushiji into contemporary Japanese characters. Opening the door to a thousand years of Japanese culture.
Stars: ✭ 16 (-89.61%)
Mutual labels:  japanese
unihandecode
unihandecode is a transliteration library to convert all characters/words in Unicode into ASCII alphabet that aware with Language preference priorities
Stars: ✭ 71 (-53.9%)
Mutual labels:  japanese
ReaperJPN-Phroneris
製品版REAPER日本語化パッチ(森)
Stars: ✭ 41 (-73.38%)
Mutual labels:  japanese
FCH-TTS
A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型,适用于英语、普通话/中文、日语、韩语、俄语和藏语(当前已测试)。
Stars: ✭ 154 (+0%)
Mutual labels:  japanese
docker-alpine-pandoc-ja
Pandoc for Japanese based on Alpine Linux
Stars: ✭ 14 (-90.91%)
Mutual labels:  japanese
kanji-frequency
Kanji usage frequency data collected from various sources
Stars: ✭ 92 (-40.26%)
Mutual labels:  japanese
nihongo
Japanese Dictionary
Stars: ✭ 77 (-50%)
Mutual labels:  japanese
cl-skkserv
Common LispによるSKK辞書サーバーとその拡張
Stars: ✭ 22 (-85.71%)
Mutual labels:  japanese
BSD
The Business Scene Dialogue corpus
Stars: ✭ 51 (-66.88%)
Mutual labels:  japanese
ra-language-japanese
Japanese messages for react-admin
Stars: ✭ 22 (-85.71%)
Mutual labels:  japanese
Jotoba
A free online, self-hostable, multilang Japanese dictionary.
Stars: ✭ 87 (-43.51%)
Mutual labels:  japanese
limelight
A php Japanese language text analyzer and parser.
Stars: ✭ 76 (-50.65%)
Mutual labels:  japanese
Domino-English-Translation
🌏 Let's translate Domino, a Japanese MIDI editor!
Stars: ✭ 29 (-81.17%)
Mutual labels:  japanese

Bunkai

PyPI version Python Versions License Downloads

CircleCI Typos CodeQL Maintainability Test Coverage markdownlint jsonlint yamllint

Bunkai is a sentence boundary (SB) disambiguation tool for Japanese texts.
Bunkaiは日本語文境界判定器です.

Quick Start

Install

$ pip install -U bunkai

Disambiguation without Models

$ echo -e '宿を予約しました♪!まだ2ヶ月も先だけど。早すぎかな(笑)楽しみです★\n2文書目の先頭行です。▁改行はU+2581で表現します。' \
    | bunkai
宿を予約しました♪!│まだ2ヶ月も先だけど。│早すぎかな(笑)│楽しみです★
2文書目の先頭行です。▁│改行はU+2581で表現します。
  • Feed a document as one line by using (U+2581) for line breaks.
    1行は1つの文書を表します.文書中の改行は (U+2581) で与えてください.
  • The output shows sentence boundaries with (U+2502).
    出力では文境界は (U+2502) で表示されます.

Disambiguation for Line Breaks with a Model

If you want to disambiguate sentence boundaries for line breaks, please add a --model option with the path to the model.
改行記号に対しても文境界判定を行いたい場合は,--modelオプションを与える必要があります.

First, please install extras to use --model option.
--modelオプションを利用するために、まずextraパッケージをインストールしてください.

$ pip install -U 'bunkai[lb]'

Second, please setup a model. It will take some time.
次にモデルをセットアップする必要があります.セットアップには少々時間がかかります.

$ bunkai --model bunkai-model-directory --setup

Then, please designate the directory.
そしてモデルを指定して動かしてください.

$ echo -e "文の途中で改行を▁入れる文章ってありますよね▁それも対象です。" | bunkai --model bunkai-model-directory
文の途中で改行を▁入れる文章ってありますよね▁│それも対象です。

Morphological Analysis Result

You can get morphological analysis results with --ma option.
--maオプションを付与すると形態素解析結果が得られます.

$ echo -e '形態素解析し▁ます。結果を 表示します!' | bunkai --ma
形態素	名詞,一般,*,*,*,*,形態素,ケイタイソ,ケイタイソ
解析	名詞,サ変接続,*,*,*,*,解析,カイセキ,カイセキ
し	動詞,自立,*,*,サ変・スル,連用形,する,シ,シ

EOS
ます	助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス
。	記号,句点,*,*,*,*,。,。,。
EOS
結果	名詞,副詞可能,*,*,*,*,結果,ケッカ,ケッカ
を	助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
 	記号,空白,*,*,*,*, ,*,*
表示	名詞,サ変接続,*,*,*,*,表示,ヒョウジ,ヒョージ
し	動詞,自立,*,*,サ変・スル,連用形,する,シ,シ
ます	助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス
!	記号,一般,*,*,*,*,!,!,!
EOS

Python Library

You can also use Bunkai as Python library.
BunkaiはPythonライブラリとしても使えます.

from bunkai import Bunkai
bunkai = Bunkai()
for sentence in bunkai("はい。このようにpythonライブラリとしても使えます!"):
    print(sentence)

For more information, see examples.
ほかの例はexamplesをご覧ください.

Documents

References

  • Yuta Hayashibe and Kensuke Mitsuzawa. Sentence Boundary Detection on Line Breaks in Japanese. Proceedings of The 6th Workshop on Noisy User-generated Text (W-NUT 2020), pp.71-75. November 2020. [PDF] [bib]

License

Apache License 2.0

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].