Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → ikawaha → Kagome

ikawaha / Kagome

Licence: mit

Self-contained Japanese Morphological Analyzer written in pure Go

Programming Languages

31211 projects - #10 most used programming language

Labels

hacktoberfest segmentation japanese korean tokenizer nlp-library pos-tagging japanese-language morphological-analysis

Projects that are alternatives of or similar to Kagome

Sudachipy

Python version of Sudachi, a Japanese tokenizer.

Stars: ✭ 207 (-62.64%)

Mutual labels: morphological-analysis, segmentation, nlp-library, pos-tagging

Jumanpp

Juman++ (a Morphological Analyzer Toolkit)

Stars: ✭ 254 (-54.15%)

Mutual labels: japanese, tokenizer, morphological-analysis, pos-tagging

Sudachi

A Japanese Tokenizer for Business

Stars: ✭ 496 (-10.47%)

Mutual labels: morphological-analysis, segmentation, nlp-library, pos-tagging

Sudachidict

A lexicon for Sudachi

Stars: ✭ 127 (-77.08%)

Mutual labels: morphological-analysis, segmentation, pos-tagging

Nagisa

A Japanese tokenizer based on recurrent neural networks

Stars: ✭ 260 (-53.07%)

Mutual labels: japanese, nlp-library, pos-tagging

retinal-exudates-detection

exudates detection using hybrid approach (Image Morphology & Machine Learning)

Stars: ✭ 53 (-90.43%)

Mutual labels: segmentation, morphological-analysis

udar

UDAR Does Accented Russian: A finite-state morphological analyzer of Russian that handles stressed wordforms.

Stars: ✭ 15 (-97.29%)

Mutual labels: pos-tagging, morphological-analysis

japanese-pitch-accent-resources

Trying to consolidate japanese phonetic, and in particular pitch accent resources into one list

Stars: ✭ 64 (-88.45%)

Mutual labels: japanese, japanese-language

ArabicProcessingCog

A Python package that do stemming, tokenization, sentence breaking, segmentation, normalization, POS tagging for Arabic language.

Stars: ✭ 19 (-96.57%)

Mutual labels: tokenizer, segmentation

Hibi

[No Active Development] An Android app for learning Japanese by keeping a journal.

Stars: ✭ 37 (-93.32%)

Mutual labels: japanese, japanese-language

rakutenma-python

Rakuten MA (Python version)

Stars: ✭ 15 (-97.29%)

Mutual labels: japanese-language, pos-tagging

TALPCo

TUFS Asian Language Parallel Corpus

Stars: ✭ 32 (-94.22%)

Mutual labels: japanese, korean

kanji-web-app

Angular.js kanji web application

Stars: ✭ 45 (-91.88%)

Mutual labels: japanese, japanese-language

simplemma

Simple multilingual lemmatizer for Python, especially useful for speed and efficiency

Stars: ✭ 32 (-94.22%)

Mutual labels: tokenizer, morphological-analysis

KanjiRecognitionDictionary

Perfect for those who forgets kanji pronunciation

Stars: ✭ 14 (-97.47%)

Mutual labels: japanese, japanese-language

KWDLC

Kyoto University Web Document Leads Corpus

Stars: ✭ 64 (-88.45%)

Mutual labels: japanese, morphological-analysis

unofficial-jisho-api

Encapsulates the official Jisho.org API and also provides kanji, example, and stroke diagram search.

Stars: ✭ 88 (-84.12%)

Mutual labels: japanese, japanese-language

Charlescd

CharlesCD is an open source tool that makes deployments more agile, continuous and safe, which allows development teams to perform hypothesis validations with a specific group of users, simultaneously.

Stars: ✭ 275 (-50.36%)

Mutual labels: hacktoberfest, segmentation

Ekphrasis

Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).

Stars: ✭ 433 (-21.84%)

Mutual labels: tokenizer, nlp-library

Domino-English-Translation

🌏 Let's translate Domino, a Japanese MIDI editor!

Stars: ✭ 29 (-94.77%)

Mutual labels: japanese, japanese-language

View All Similar Projects ➔

Kagome v2

Kagome is an open source Japanese morphological analyzer written in pure golang. The dictionary/statistical models such as MeCab-IPADIC, UniDic (unidic-mecab) and so on, are able to be embedded in binaries.

Improvements from v1.

Dictionaries are maintained in a separate repository, and only the dictionaries you need are embedded in the binary.
Brushed up and added several APIs.

Dictionaries

dict	source	package
MeCab IPADIC	mecab-ipadic-2.7.0-20070801	github.com/ikawaha/kagome-dict/ipa
UniDIC	unidic-mecab-2.1.2_src	github.com/ikawaha/kagome-dict/uni

Experimental Features

dict	source	package
mecab-ipadic-NEologd	mecab-ipadic-neologd	github.com/ikawaha/kagome-ipa-neologd
Korean MeCab	mecab-ko-dic-2.1.1-20180720	github.com/ikawaha/kagome-dict-ko

Segmentation mode for search

Kagome has segmentation mode for search such as Kuromoji.

Normal: Regular segmentation
Search: Use a heuristic to do additional segmentation useful for search
Extended: Similar to search mode, but also uni-gram unknown words

Untokenized	Normal	Search	Extended
関西国際空港	関西国際空港	関西　国際　空港	関西　国際　空港
日本経済新聞	日本経済新聞	日本　経済　新聞	日本　経済　新聞
シニアソフトウェアエンジニア	シニアソフトウェアエンジニア	シニア　ソフトウェア　エンジニア	シニア　ソフトウェア　エンジニア
デジカメを買った	デジカメ　を　買っ　た	デジカメ　を　買っ　た	デ　ジ　カ　メ　を　買っ　た

Programming example

package main

import (
	"fmt"
	"strings"

	"github.com/ikawaha/kagome-dict/ipa"
	"github.com/ikawaha/kagome/v2/tokenizer"
)

func main() {
	t, err := tokenizer.New(ipa.Dict(), tokenizer.OmitBosEos())
	if err != nil {
		panic(err)
	}
	// wakati
	fmt.Println("---wakati---")
	seg := t.Wakati("すもももももももものうち")
	fmt.Println(seg)

	// tokenize
	fmt.Println("---tokenize---")
	tokens := t.Tokenize("すもももももももものうち")
	for _, token := range tokens {
		features := strings.Join(token.Features(), ",")
		fmt.Printf("%s\t%v\n", token.Surface, features)
	}
}

output:

---wakati---
[すもも も もも も もも の うち]
---tokenize---
すもも	名詞,一般,*,*,*,*,すもも,スモモ,スモモ
も	助詞,係助詞,*,*,*,*,も,モ,モ
もも	名詞,一般,*,*,*,*,もも,モモ,モモ
も	助詞,係助詞,*,*,*,*,も,モ,モ
もも	名詞,一般,*,*,*,*,もも,モモ,モモ
の	助詞,連体化,*,*,*,*,の,ノ,ノ
うち	名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ

Reference

Commands

Install

env GO111MODULE=on go get -u github.com/ikawaha/kagome/v2

Homebrew tap

brew install ikawaha/kagome/kagome

Usage

$ kagome -h
Japanese Morphological Analyzer -- github.com/ikawaha/kagome/v2
usage: kagome <command>
The commands are:
   [tokenize] - command line tokenize (*default)
   server - run tokenize server
   lattice - lattice viewer
   version - show version

tokenize [-file input_file] [-dict dic_file] [-userdict userdic_file] [-sysdict (ipa|uni)] [-simple false] [-mode (normal|search|extended)]
  -dict string
    	dict
  -file string
    	input file
  -mode string
    	tokenize mode (normal|search|extended) (default "normal")
  -simple
    	display abbreviated dictionary contents
  -sysdict string
    	system dict type (ipa|uni) (default "ipa")
  -udict string
    	user dict

Tokenize command

% kagome
すもももももももものうち
すもも	名詞,一般,*,*,*,*,すもも,スモモ,スモモ
も	助詞,係助詞,*,*,*,*,も,モ,モ
もも	名詞,一般,*,*,*,*,もも,モモ,モモ
も	助詞,係助詞,*,*,*,*,も,モ,モ
もも	名詞,一般,*,*,*,*,もも,モモ,モモ
の	助詞,連体化,*,*,*,*,の,ノ,ノ
うち	名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ
EOS

Server command

API

Start a server and try to access the "/tokenize" endpoint.

% kagome server &
% curl -XPUT localhost:6060/tokenize -d'{"sentence":"すもももももももものうち", "mode":"normal"}' | jq .

Web App

Start a server and access http://localhost:6060. (To draw a lattice, demo application uses graphviz . You need graphviz installed.)

% kagome server &

Lattice command

A debug tool of tokenize process outputs a lattice in graphviz dot format.

% kagome lattice 私は鰻 | dot -Tpng -o lattice.png

Docker

Licence

MIT

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 554

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (2) 🔗