ikawaha / Kagome
Licence: mit
Self-contained Japanese Morphological Analyzer written in pure Go
Stars: ✭ 554
Programming Languages
go
31211 projects - #10 most used programming language
Labels
Projects that are alternatives of or similar to Kagome
Sudachipy
Python version of Sudachi, a Japanese tokenizer.
Stars: ✭ 207 (-62.64%)
Mutual labels: morphological-analysis, segmentation, nlp-library, pos-tagging
Jumanpp
Juman++ (a Morphological Analyzer Toolkit)
Stars: ✭ 254 (-54.15%)
Mutual labels: japanese, tokenizer, morphological-analysis, pos-tagging
Sudachi
A Japanese Tokenizer for Business
Stars: ✭ 496 (-10.47%)
Mutual labels: morphological-analysis, segmentation, nlp-library, pos-tagging
Sudachidict
A lexicon for Sudachi
Stars: ✭ 127 (-77.08%)
Mutual labels: morphological-analysis, segmentation, pos-tagging
Nagisa
A Japanese tokenizer based on recurrent neural networks
Stars: ✭ 260 (-53.07%)
Mutual labels: japanese, nlp-library, pos-tagging
retinal-exudates-detection
exudates detection using hybrid approach (Image Morphology & Machine Learning)
Stars: ✭ 53 (-90.43%)
Mutual labels: segmentation, morphological-analysis
udar
UDAR Does Accented Russian: A finite-state morphological analyzer of Russian that handles stressed wordforms.
Stars: ✭ 15 (-97.29%)
Mutual labels: pos-tagging, morphological-analysis
japanese-pitch-accent-resources
Trying to consolidate japanese phonetic, and in particular pitch accent resources into one list
Stars: ✭ 64 (-88.45%)
Mutual labels: japanese, japanese-language
ArabicProcessingCog
A Python package that do stemming, tokenization, sentence breaking, segmentation, normalization, POS tagging for Arabic language.
Stars: ✭ 19 (-96.57%)
Mutual labels: tokenizer, segmentation
Hibi
[No Active Development] An Android app for learning Japanese by keeping a journal.
Stars: ✭ 37 (-93.32%)
Mutual labels: japanese, japanese-language
rakutenma-python
Rakuten MA (Python version)
Stars: ✭ 15 (-97.29%)
Mutual labels: japanese-language, pos-tagging
kanji-web-app
Angular.js kanji web application
Stars: ✭ 45 (-91.88%)
Mutual labels: japanese, japanese-language
simplemma
Simple multilingual lemmatizer for Python, especially useful for speed and efficiency
Stars: ✭ 32 (-94.22%)
Mutual labels: tokenizer, morphological-analysis
KanjiRecognitionDictionary
Perfect for those who forgets kanji pronunciation
Stars: ✭ 14 (-97.47%)
Mutual labels: japanese, japanese-language
KWDLC
Kyoto University Web Document Leads Corpus
Stars: ✭ 64 (-88.45%)
Mutual labels: japanese, morphological-analysis
unofficial-jisho-api
Encapsulates the official Jisho.org API and also provides kanji, example, and stroke diagram search.
Stars: ✭ 88 (-84.12%)
Mutual labels: japanese, japanese-language
Charlescd
CharlesCD is an open source tool that makes deployments more agile, continuous and safe, which allows development teams to perform hypothesis validations with a specific group of users, simultaneously.
Stars: ✭ 275 (-50.36%)
Mutual labels: hacktoberfest, segmentation
Ekphrasis
Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).
Stars: ✭ 433 (-21.84%)
Mutual labels: tokenizer, nlp-library
Domino-English-Translation
🌏 Let's translate Domino, a Japanese MIDI editor!
Stars: ✭ 29 (-94.77%)
Mutual labels: japanese, japanese-language
Kagome v2
Kagome is an open source Japanese morphological analyzer written in pure golang. The dictionary/statistical models such as MeCab-IPADIC, UniDic (unidic-mecab) and so on, are able to be embedded in binaries.
v1.
Improvements from- Dictionaries are maintained in a separate repository, and only the dictionaries you need are embedded in the binary.
- Brushed up and added several APIs.
Dictionaries
dict | source | package |
---|---|---|
MeCab IPADIC | mecab-ipadic-2.7.0-20070801 | github.com/ikawaha/kagome-dict/ipa |
UniDIC | unidic-mecab-2.1.2_src | github.com/ikawaha/kagome-dict/uni |
Experimental Features
dict | source | package |
---|---|---|
mecab-ipadic-NEologd | mecab-ipadic-neologd | github.com/ikawaha/kagome-ipa-neologd |
Korean MeCab | mecab-ko-dic-2.1.1-20180720 | github.com/ikawaha/kagome-dict-ko |
Segmentation mode for search
Kagome has segmentation mode for search such as Kuromoji.
- Normal: Regular segmentation
- Search: Use a heuristic to do additional segmentation useful for search
- Extended: Similar to search mode, but also uni-gram unknown words
Untokenized | Normal | Search | Extended |
---|---|---|---|
関西国際空港 | 関西国際空港 | 関西 国際 空港 | 関西 国際 空港 |
日本経済新聞 | 日本経済新聞 | 日本 経済 新聞 | 日本 経済 新聞 |
シニアソフトウェアエンジニア | シニアソフトウェアエンジニア | シニア ソフトウェア エンジニア | シニア ソフトウェア エンジニア |
デジカメを買った | デジカメ を 買っ た | デジカメ を 買っ た | デ ジ カ メ を 買っ た |
Programming example
package main
import (
"fmt"
"strings"
"github.com/ikawaha/kagome-dict/ipa"
"github.com/ikawaha/kagome/v2/tokenizer"
)
func main() {
t, err := tokenizer.New(ipa.Dict(), tokenizer.OmitBosEos())
if err != nil {
panic(err)
}
// wakati
fmt.Println("---wakati---")
seg := t.Wakati("すもももももももものうち")
fmt.Println(seg)
// tokenize
fmt.Println("---tokenize---")
tokens := t.Tokenize("すもももももももものうち")
for _, token := range tokens {
features := strings.Join(token.Features(), ",")
fmt.Printf("%s\t%v\n", token.Surface, features)
}
}
output:
---wakati---
[すもも も もも も もも の うち]
---tokenize---
すもも 名詞,一般,*,*,*,*,すもも,スモモ,スモモ
も 助詞,係助詞,*,*,*,*,も,モ,モ
もも 名詞,一般,*,*,*,*,もも,モモ,モモ
も 助詞,係助詞,*,*,*,*,も,モ,モ
もも 名詞,一般,*,*,*,*,もも,モモ,モモ
の 助詞,連体化,*,*,*,*,の,ノ,ノ
うち 名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ
Reference
Commands
Install
Go
env GO111MODULE=on go get -u github.com/ikawaha/kagome/v2
Homebrew tap
brew install ikawaha/kagome/kagome
Usage
$ kagome -h
Japanese Morphological Analyzer -- github.com/ikawaha/kagome/v2
usage: kagome <command>
The commands are:
[tokenize] - command line tokenize (*default)
server - run tokenize server
lattice - lattice viewer
version - show version
tokenize [-file input_file] [-dict dic_file] [-userdict userdic_file] [-sysdict (ipa|uni)] [-simple false] [-mode (normal|search|extended)]
-dict string
dict
-file string
input file
-mode string
tokenize mode (normal|search|extended) (default "normal")
-simple
display abbreviated dictionary contents
-sysdict string
system dict type (ipa|uni) (default "ipa")
-udict string
user dict
Tokenize command
% kagome
すもももももももものうち
すもも 名詞,一般,*,*,*,*,すもも,スモモ,スモモ
も 助詞,係助詞,*,*,*,*,も,モ,モ
もも 名詞,一般,*,*,*,*,もも,モモ,モモ
も 助詞,係助詞,*,*,*,*,も,モ,モ
もも 名詞,一般,*,*,*,*,もも,モモ,モモ
の 助詞,連体化,*,*,*,*,の,ノ,ノ
うち 名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ
EOS
Server command
API
Start a server and try to access the "/tokenize" endpoint.
% kagome server &
% curl -XPUT localhost:6060/tokenize -d'{"sentence":"すもももももももものうち", "mode":"normal"}' | jq .
Web App
Start a server and access http://localhost:6060
.
(To draw a lattice, demo application uses graphviz . You need graphviz installed.)
% kagome server &
Lattice command
A debug tool of tokenize process outputs a lattice in graphviz dot format.
% kagome lattice 私は鰻 | dot -Tpng -o lattice.png
Docker
Licence
MIT
Note that the project description data, including the texts, logos, images, and/or trademarks,
for each open source project belongs to its rightful owner.
If you wish to add or remove any projects, please contact us at [email protected].