All Projects → ikawaha → Kagome

ikawaha / Kagome

Licence: mit
Self-contained Japanese Morphological Analyzer written in pure Go

Programming Languages

go
31211 projects - #10 most used programming language

Projects that are alternatives of or similar to Kagome

Sudachipy
Python version of Sudachi, a Japanese tokenizer.
Stars: ✭ 207 (-62.64%)
Mutual labels:  morphological-analysis, segmentation, nlp-library, pos-tagging
Jumanpp
Juman++ (a Morphological Analyzer Toolkit)
Stars: ✭ 254 (-54.15%)
Mutual labels:  japanese, tokenizer, morphological-analysis, pos-tagging
Sudachi
A Japanese Tokenizer for Business
Stars: ✭ 496 (-10.47%)
Mutual labels:  morphological-analysis, segmentation, nlp-library, pos-tagging
Sudachidict
A lexicon for Sudachi
Stars: ✭ 127 (-77.08%)
Mutual labels:  morphological-analysis, segmentation, pos-tagging
Nagisa
A Japanese tokenizer based on recurrent neural networks
Stars: ✭ 260 (-53.07%)
Mutual labels:  japanese, nlp-library, pos-tagging
retinal-exudates-detection
exudates detection using hybrid approach (Image Morphology & Machine Learning)
Stars: ✭ 53 (-90.43%)
Mutual labels:  segmentation, morphological-analysis
udar
UDAR Does Accented Russian: A finite-state morphological analyzer of Russian that handles stressed wordforms.
Stars: ✭ 15 (-97.29%)
Mutual labels:  pos-tagging, morphological-analysis
japanese-pitch-accent-resources
Trying to consolidate japanese phonetic, and in particular pitch accent resources into one list
Stars: ✭ 64 (-88.45%)
Mutual labels:  japanese, japanese-language
ArabicProcessingCog
A Python package that do stemming, tokenization, sentence breaking, segmentation, normalization, POS tagging for Arabic language.
Stars: ✭ 19 (-96.57%)
Mutual labels:  tokenizer, segmentation
Hibi
[No Active Development] An Android app for learning Japanese by keeping a journal.
Stars: ✭ 37 (-93.32%)
Mutual labels:  japanese, japanese-language
rakutenma-python
Rakuten MA (Python version)
Stars: ✭ 15 (-97.29%)
Mutual labels:  japanese-language, pos-tagging
TALPCo
TUFS Asian Language Parallel Corpus
Stars: ✭ 32 (-94.22%)
Mutual labels:  japanese, korean
kanji-web-app
Angular.js kanji web application
Stars: ✭ 45 (-91.88%)
Mutual labels:  japanese, japanese-language
simplemma
Simple multilingual lemmatizer for Python, especially useful for speed and efficiency
Stars: ✭ 32 (-94.22%)
Mutual labels:  tokenizer, morphological-analysis
KanjiRecognitionDictionary
Perfect for those who forgets kanji pronunciation
Stars: ✭ 14 (-97.47%)
Mutual labels:  japanese, japanese-language
KWDLC
Kyoto University Web Document Leads Corpus
Stars: ✭ 64 (-88.45%)
Mutual labels:  japanese, morphological-analysis
unofficial-jisho-api
Encapsulates the official Jisho.org API and also provides kanji, example, and stroke diagram search.
Stars: ✭ 88 (-84.12%)
Mutual labels:  japanese, japanese-language
Charlescd
CharlesCD is an open source tool that makes deployments more agile, continuous and safe, which allows development teams to perform hypothesis validations with a specific group of users, simultaneously.
Stars: ✭ 275 (-50.36%)
Mutual labels:  hacktoberfest, segmentation
Ekphrasis
Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).
Stars: ✭ 433 (-21.84%)
Mutual labels:  tokenizer, nlp-library
Domino-English-Translation
🌏 Let's translate Domino, a Japanese MIDI editor!
Stars: ✭ 29 (-94.77%)
Mutual labels:  japanese, japanese-language

GoDev Go Coverage Status Docker Images Docker Pulls deploy demo

Kagome v2

Kagome is an open source Japanese morphological analyzer written in pure golang. The dictionary/statistical models such as MeCab-IPADIC, UniDic (unidic-mecab) and so on, are able to be embedded in binaries.

Improvements from v1.

  • Dictionaries are maintained in a separate repository, and only the dictionaries you need are embedded in the binary.
  • Brushed up and added several APIs.

Dictionaries

dict source package
MeCab IPADIC mecab-ipadic-2.7.0-20070801 github.com/ikawaha/kagome-dict/ipa
UniDIC unidic-mecab-2.1.2_src github.com/ikawaha/kagome-dict/uni

Experimental Features

dict source package
mecab-ipadic-NEologd mecab-ipadic-neologd github.com/ikawaha/kagome-ipa-neologd
Korean MeCab mecab-ko-dic-2.1.1-20180720 github.com/ikawaha/kagome-dict-ko

Segmentation mode for search

Kagome has segmentation mode for search such as Kuromoji.

  • Normal: Regular segmentation
  • Search: Use a heuristic to do additional segmentation useful for search
  • Extended: Similar to search mode, but also uni-gram unknown words
Untokenized Normal Search Extended
関西国際空港 関西国際空港 関西 国際 空港 関西 国際 空港
日本経済新聞 日本経済新聞 日本 経済 新聞 日本 経済 新聞
シニアソフトウェアエンジニア シニアソフトウェアエンジニア シニア ソフトウェア エンジニア シニア ソフトウェア エンジニア
デジカメを買った デジカメ を 買っ た デジカメ を 買っ た デ ジ カ メ を 買っ た

Programming example

package main

import (
	"fmt"
	"strings"

	"github.com/ikawaha/kagome-dict/ipa"
	"github.com/ikawaha/kagome/v2/tokenizer"
)

func main() {
	t, err := tokenizer.New(ipa.Dict(), tokenizer.OmitBosEos())
	if err != nil {
		panic(err)
	}
	// wakati
	fmt.Println("---wakati---")
	seg := t.Wakati("すもももももももものうち")
	fmt.Println(seg)

	// tokenize
	fmt.Println("---tokenize---")
	tokens := t.Tokenize("すもももももももものうち")
	for _, token := range tokens {
		features := strings.Join(token.Features(), ",")
		fmt.Printf("%s\t%v\n", token.Surface, features)
	}
}

output:

---wakati---
[すもも も もも も もも の うち]
---tokenize---
すもも	名詞,一般,*,*,*,*,すもも,スモモ,スモモ
も	助詞,係助詞,*,*,*,*,も,モ,モ
もも	名詞,一般,*,*,*,*,もも,モモ,モモ
も	助詞,係助詞,*,*,*,*,も,モ,モ
もも	名詞,一般,*,*,*,*,もも,モモ,モモ
の	助詞,連体化,*,*,*,*,の,ノ,ノ
うち	名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ

Reference

実践:形態素解析 kagome v2

Commands

Install

Go

env GO111MODULE=on go get -u github.com/ikawaha/kagome/v2

Homebrew tap

brew install ikawaha/kagome/kagome

Usage

$ kagome -h
Japanese Morphological Analyzer -- github.com/ikawaha/kagome/v2
usage: kagome <command>
The commands are:
   [tokenize] - command line tokenize (*default)
   server - run tokenize server
   lattice - lattice viewer
   version - show version

tokenize [-file input_file] [-dict dic_file] [-userdict userdic_file] [-sysdict (ipa|uni)] [-simple false] [-mode (normal|search|extended)]
  -dict string
    	dict
  -file string
    	input file
  -mode string
    	tokenize mode (normal|search|extended) (default "normal")
  -simple
    	display abbreviated dictionary contents
  -sysdict string
    	system dict type (ipa|uni) (default "ipa")
  -udict string
    	user dict

Tokenize command

% kagome
すもももももももものうち
すもも	名詞,一般,*,*,*,*,すもも,スモモ,スモモ
も	助詞,係助詞,*,*,*,*,も,モ,モ
もも	名詞,一般,*,*,*,*,もも,モモ,モモ
も	助詞,係助詞,*,*,*,*,も,モ,モ
もも	名詞,一般,*,*,*,*,もも,モモ,モモ
の	助詞,連体化,*,*,*,*,の,ノ,ノ
うち	名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ
EOS

Server command

API

Start a server and try to access the "/tokenize" endpoint.

% kagome server &
% curl -XPUT localhost:6060/tokenize -d'{"sentence":"すもももももももものうち", "mode":"normal"}' | jq . 

Web App

demo

Start a server and access http://localhost:6060. (To draw a lattice, demo application uses graphviz . You need graphviz installed.)

% kagome server &

Lattice command

A debug tool of tokenize process outputs a lattice in graphviz dot format.

% kagome lattice 私は鰻 | dot -Tpng -o lattice.png

lattice

Docker

Docker

Licence

MIT

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].