All Projects â†’ yoshoku â†’ suika

yoshoku / suika

Licence: BSD-3-Clause license
Suika 🍉 is a Japanese morphological analyzer written in pure Ruby

Programming Languages

ruby
36898 projects - #4 most used programming language

Projects that are alternatives of or similar to suika

simplemma
Simple multilingual lemmatizer for Python, especially useful for speed and efficiency
Stars: ✭ 32 (+3.23%)
Mutual labels:  tokenizer, morphological-analysis
Jumanpp
Juman++ (a Morphological Analyzer Toolkit)
Stars: ✭ 254 (+719.35%)
Mutual labels:  tokenizer, morphological-analysis
Kagome
Self-contained Japanese Morphological Analyzer written in pure Go
Stars: ✭ 554 (+1687.1%)
Mutual labels:  tokenizer, morphological-analysis
Works For Me
Collection of developer toolkits
Stars: ✭ 131 (+322.58%)
Mutual labels:  tokenizer
Lex
Replaced by foonathan/lexy
Stars: ✭ 137 (+341.94%)
Mutual labels:  tokenizer
Roy VnTokenizer
Vietnamese tokenizer (Maximum Matching and CRF)
Stars: ✭ 49 (+58.06%)
Mutual labels:  tokenizer
Quantitative-Big-Imaging-2018
(Latest semester at https://github.com/kmader/Quantitative-Big-Imaging-2019) The material for the Quantitative Big Imaging course at ETHZ for the Spring Semester 2018
Stars: ✭ 50 (+61.29%)
Mutual labels:  morphological-analysis
Chevrotain
Parser Building Toolkit for JavaScript
Stars: ✭ 1,795 (+5690.32%)
Mutual labels:  tokenizer
yap
Yet Another (natural language) Parser
Stars: ✭ 40 (+29.03%)
Mutual labels:  morphological-analysis
greeb
Greeb is a simple Unicode-aware regexp-based tokenizer.
Stars: ✭ 16 (-48.39%)
Mutual labels:  tokenizer
Bitextor
Bitextor generates translation memories from multilingual websites.
Stars: ✭ 168 (+441.94%)
Mutual labels:  tokenizer
Udpipe
R package for Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing Based on the UDPipe Natural Language Processing Toolkit
Stars: ✭ 160 (+416.13%)
Mutual labels:  tokenizer
Neural-Morphological-Disambiguation-for-Turkish-DEPRECATED
Neural morphological disambiguation for Turkish. Implemented in DyNet
Stars: ✭ 11 (-64.52%)
Mutual labels:  morphological-analysis
Tokenizer
Fast and customizable text tokenization library with BPE and SentencePiece support
Stars: ✭ 132 (+325.81%)
Mutual labels:  tokenizer
Tokenizer
A tokenizer for Icelandic text
Stars: ✭ 27 (-12.9%)
Mutual labels:  tokenizer
Fugashi
A Cython MeCab wrapper for fast, pythonic Japanese tokenization and morphological analysis.
Stars: ✭ 125 (+303.23%)
Mutual labels:  tokenizer
lexertk
C++ Lexer Toolkit Library (LexerTk) https://www.partow.net/programming/lexertk/index.html
Stars: ✭ 26 (-16.13%)
Mutual labels:  tokenizer
Js Tokens
Tiny JavaScript tokenizer.
Stars: ✭ 166 (+435.48%)
Mutual labels:  tokenizer
Query Translator
Query Translator is a search query translator with AST representation
Stars: ✭ 165 (+432.26%)
Mutual labels:  tokenizer
sinling
A collection of NLP tools for Sinhalese (සිංහල).
Stars: ✭ 38 (+22.58%)
Mutual labels:  tokenizer

Suika

Build Status Gem Version BSD 3-Clause License Documentation

Suika 🍉 is a Japanese morphological analyzer written in pure Ruby.

Installation

Add this line to your application's Gemfile:

gem 'suika'

And then execute:

$ bundle install

Or install it yourself as:

$ gem install suika

Usage

require 'suika'

tagger = Suika::Tagger.new
tagger.parse('すもももももももものうち').each { |token| puts token }

# すもも  名詞,一般,*,*,*,*,すもも,スモモ,スモモ
# も      助詞,係助詞,*,*,*,*,も,モ,モ
# もも    名詞,一般,*,*,*,*,もも,モモ,モモ
# も      助詞,係助詞,*,*,*,*,も,モ,モ
# もも    名詞,一般,*,*,*,*,もも,モモ,モモ
# の      助詞,連体化,*,*,*,*,の,ノ,ノ
# うち    名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ

Since the Tagger class loads the binary dictionary at initialization, it is recommended to reuse the instance.

tagger = Suika::Tagger.new

sentences.each do |sentence|
  result = tagger.parse(sentence)

  # ...
end

Test

Suika was able to parse all sentences in the Livedoor news corpus without any error.

require 'suika'

tagger = Suika::Tagger.new

Dir.glob('ldcc-20140209/text/*/*.txt').each do |filename|
  File.foreach(filename) do |sentence|
    sentence.strip!
    puts tagger.parse(sentence) unless sentence.empty?
  end
end

suika_test

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/yoshoku/suika. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the Contributor Covenant code of conduct.

License

The gem is available as open source under the terms of the BSD-3-Clause License. In addition, the gem includes binary data generated from mecab-ipadic. The details of the license can be found in LICENSE.txt and NOTICE.txt.

Respect

  • Taku Kudo is the author of MeCab that is the most famous morphological analyzer in Japan. MeCab is one of the great software in natural language processing. Suika is created with reference to the book on morphological analysis written by Dr. Kudo.
  • Tomoko Uchida is the author of Janome that is a Japanese morphological analysis engine written in pure Python. Suika is heavily influenced by Janome's idea to include the built-in dictionary and language model. Janome, a morphological analyzer written in scripting language, gives me the courage to develop Suika.
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].