All Projects → atilika → Kuromoji

atilika / Kuromoji

Licence: apache-2.0
Kuromoji is a self-contained and very easy to use Japanese morphological analyzer designed for search

Programming Languages

java
68154 projects - #9 most used programming language

Projects that are alternatives of or similar to Kuromoji

Kagome
Self-contained Japanese Morphological Analyzer written in pure Go
Stars: ✭ 554 (-25.64%)
Mutual labels:  japanese, nlp-library
GrammarEngine
Грамматический Словарь Русского Языка (+ английский, японский, etc)
Stars: ✭ 68 (-90.87%)
Mutual labels:  nlp-library, part-of-speech-tagger
Lingo
package lingo provides the data structures and algorithms required for natural language processing
Stars: ✭ 113 (-84.83%)
Mutual labels:  nlp-library, part-of-speech-tagger
Jumanpp
Juman++ (a Morphological Analyzer Toolkit)
Stars: ✭ 254 (-65.91%)
Mutual labels:  japanese, part-of-speech-tagger
Toiro
A comparison tool of Japanese tokenizers
Stars: ✭ 95 (-87.25%)
Mutual labels:  japanese, nlp-library
Nagisa
A Japanese tokenizer based on recurrent neural networks
Stars: ✭ 260 (-65.1%)
Mutual labels:  japanese, nlp-library
Contextualized Topic Models
A python package to run contextualized topic modeling. CTMs combine BERT with topic models to get coherent topics. Also supports multilingual tasks. Cross-lingual Zero-shot model published at EACL 2021.
Stars: ✭ 318 (-57.32%)
Mutual labels:  nlp-library
Spacy
💫 Industrial-strength Natural Language Processing (NLP) in Python
Stars: ✭ 21,978 (+2850.07%)
Mutual labels:  nlp-library
Yakuhanjp
Yakumono-Hankaku Only Web Fonts
Stars: ✭ 288 (-61.34%)
Mutual labels:  japanese
Site
cpprefjpサイトのMarkdownソース
Stars: ✭ 275 (-63.09%)
Mutual labels:  japanese
Janome
Japanese morphological analysis engine written in pure Python
Stars: ✭ 630 (-15.44%)
Mutual labels:  nlp-library
Wanikani For Android
An android client application for the awesome kanji learning website wanikani.com
Stars: ✭ 506 (-32.08%)
Mutual labels:  japanese
Ekphrasis
Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).
Stars: ✭ 433 (-41.88%)
Mutual labels:  nlp-library
Lingua
👄 The most accurate natural language detection library for Java and the JVM, suitable for long and short text alike
Stars: ✭ 341 (-54.23%)
Mutual labels:  nlp-library
Awesome Persian Nlp Ir
Curated List of Persian Natural Language Processing and Information Retrieval Tools and Resources
Stars: ✭ 460 (-38.26%)
Mutual labels:  part-of-speech-tagger
Giveme5w1h
Extraction of the journalistic five W and one H questions (5W1H) from news articles: who did what, when, where, why, and how?
Stars: ✭ 316 (-57.58%)
Mutual labels:  nlp-library
Quick Nlp
Pytorch NLP library based on FastAI
Stars: ✭ 279 (-62.55%)
Mutual labels:  nlp-library
Pynlpl
PyNLPl, pronounced as 'pineapple', is a Python library for Natural Language Processing. It contains various modules useful for common, and less common, NLP tasks. PyNLPl can be used for basic tasks such as the extraction of n-grams and frequency lists, and to build simple language model. There are also more complex data types and algorithms. Moreover, there are parsers for file formats common in NLP (e.g. FoLiA/Giza/Moses/ARPA/Timbl/CQL). There are also clients to interface with various NLP specific servers. PyNLPl most notably features a very extensive library for working with FoLiA XML (Format for Linguistic Annotation).
Stars: ✭ 426 (-42.82%)
Mutual labels:  nlp-library
Sudachi
A Japanese Tokenizer for Business
Stars: ✭ 496 (-33.42%)
Mutual labels:  nlp-library
Kuroshiro
Japanese language library for converting Japanese sentence to Hiragana, Katakana or Romaji with furigana and okurigana modes supported.
Stars: ✭ 386 (-48.19%)
Mutual labels:  japanese

Kuromoji Build Status

Kuromoji is an easy to use and self-contained Japanese morphological analyzer that does

  • Word segmentation. Segmenting text into words (or morphemes)
  • Part-of-speech tagging. Assign word-categories (nouns, verbs, particles, adjectives, etc.)
  • Lemmatization. Get dictionary forms for inflected verbs and adjectives
  • Readings. Extract readings for kanji

Several other features are supported. Please consult each dictionaries' Token class for details.

Using Kuromoji

The example below shows how to use the Kuromoji morphological analyzer in its simlest form; to segment text into tokens and output features for each token.

package com.atilika.kuromoji.example;

import com.atilika.kuromoji.ipadic.Token;
import com.atilika.kuromoji.ipadic.Tokenizer;
import java.util.List;

public class KuromojiExample {
    public static void main(String[] args) {
        Tokenizer tokenizer = new Tokenizer() ;
        List<Token> tokens = tokenizer.tokenize("お寿司が食べたい。");
        for (Token token : tokens) {
            System.out.println(token.getSurface() + "\t" + token.getAllFeatures());
        }
    }
}

Make sure you add the dependency below to your pom.xml before building your project.

<dependency>
  <groupId>com.atilika.kuromoji</groupId>
  <artifactId>kuromoji-ipadic</artifactId>
  <version>0.9.0</version>
</dependency>

When running the above program, you will get the following output:

お   接頭詞,名詞接続,*,*,*,*,お,オ,オ
寿司  名詞,一般,*,*,*,*,寿司,スシ,スシ
が   助詞,格助詞,一般,*,*,*,が,ガ,ガ
食べ  動詞,自立,*,*,一段,連用形,食べる,タベ,タベ
たい  助動詞,*,*,*,特殊・タイ,基本形,たい,タイ,タイ
。   記号,句点,*,*,*,*,。,。,。

See the documentation for the com.atilika.kuromoji.ipadic.Token class for more information on the per-token features available.

Supported dictionaries

Kuromoji currently supports the following dictionaries:

Question: So which of these dictionaries should I use?

Answer: That depends on your application. Yes, we know - it's a boring answer... :)

If you are not sure about which dictionary you should use, kuromoji-ipadic is a good starting point for many applications.

See the getters in the per-dictionary Token classes for some more information on available token features - or consult the technical dictionary documentation elsewhere. (We plan on adding better guidance on choosing a dictionary.)

Maven coordinates and user classes

Each dictionary has its own Maven coordinates, and a Tokenizer and a Token class similar to that in the above example. These classes live in a designated packaged space indicated by the dictionary type.

The sections below list fully qualified class names and the Maven coordinates for each dictionary supported.

kuromoji-ipadic

  • com.atilika.kuromoji.ipadic.Tokenizer
  • com.atilika.kuromoji.ipadic.Token
<dependency>
  <groupId>com.atilika.kuromoji</groupId>
  <artifactId>kuromoji-ipadic</artifactId>
  <version>0.9.0</version>
</dependency>

kuromoji-ipadic-neologd

  • com.atilika.kuromoji.ipadic.neologd.Tokenizer
  • com.atilika.kuromoji.ipadic.neologd.Token

This dictionary will be available from Maven Central in a future version.

kuromoji-jumandic

  • com.atilika.kuromoji.jumandic.Tokenizer
  • com.atilika.kuromoji.jumandic.Token
<dependency>
  <groupId>com.atilika.kuromoji</groupId>
  <artifactId>kuromoji-jumandic</artifactId>
  <version>0.9.0</version>
</dependency>

kuromoji-naist-jdic

  • com.atilika.kuromoji.naist.jdic.Tokenizer
  • com.atilika.kuromoji.naist.jdic.Token
<dependency>
  <groupId>com.atilika.kuromoji</groupId>
  <artifactId>kuromoji-naist-jdic</artifactId>
  <version>0.9.0</version>
</dependency>

kuromoji-unidic

  • com.atilika.kuromoji.unidic.Tokenizer
  • com.atilika.kuromoji.unidic.Token
<dependency>
  <groupId>com.atilika.kuromoji</groupId>
  <artifactId>kuromoji-unidic</artifactId>
  <version>0.9.0</version>
</dependency>

kuromoji-unidic-kanaaccent

  • com.atilika.kuromoji.unidic.kanaaccent.Tokenizer
  • com.atilika.kuromoji.unidic.kanaaccent.Token
<dependency>
  <groupId>com.atilika.kuromoji</groupId>
  <artifactId>kuromoji-unidic-kanaaccent</artifactId>
  <version>0.9.0</version>
</dependency>

kuromoji-unidic-neologd

  • com.atilika.kuromoji.unidic.neologd.Tokenizer
  • com.atilika.kuromoji.unidic.kanaaneologdcent.Token

This dictionary will be available from Maven Central in a future version.

Building Kuromoji from source code

Released version of Kuromoji are available from Maven Central.

If you want to build Kuromoji from source code, run the following command:

$ mvn clean package

This will download all source dictionary data and build Kuromoji with all dictionaries. The following jars will then be available:

kuromoji-core/target/kuromoji-core-1.0-SNAPSHOT.jar
kuromoji-ipadic/target/kuromoji-ipadic-1.0-SNAPSHOT.jar
kuromoji-ipadic-neologd/target/kuromoji-ipadic-neologd-1.0-SNAPSHOT.jar
kuromoji-jumandic/target/kuromoji-jumandic-1.0-SNAPSHOT.jar
kuromoji-naist-jdic/target/kuromoji-naist-jdic-1.0-SNAPSHOT.jar
kuromoji-unidic/target/kuromoji-unidic-1.0-SNAPSHOT.jar
kuromoji-unidic-kanaaccent/target/kuromoji-unidic-kanaaccent-1.0-SNAPSHOT.jar
kuromoji-unidic-neologd/target/kuromoji-unidic-neologd-1.0-SNAPSHOT.jar

The following additional build options are available:

  • -DskipCompileDictionary Do not recompile the dictionaries
  • -DskipDownloadDictionary Do not download source dictionaries
  • -DbenchmarkTokenizers Profile each tokenizer during the package phase using content from Japanese Wikipedia
  • -DskipDownloadWikipedia Prevent the compressed version of the Japanese Wikipedia (~765 MB) from being downloaded during profiling, i.e. if it has already been downloaded.

License

Kuromoji is licensed under the Apache License, Version 2.0. See LICENSE.md for details.

This software also includes a binary and/or source version of data from various 3rd party dictionaries. See NOTICE.md for these details.

Contributing

Please open up issues if you have a feature request. We also welcome contributions through pull requests.

You will retain copyright to your own contributions, but you need to license them using the Apache License, Version 2.0. All contributors will be mentioned in the CONTRIBUTORS.md file.

About us

We are a small team of experienced software engineers based in Tokyo who offers technologies and good advice in the field of search, natural language processing and big data analytics.

Please feel free to contact us at [email protected] if you have any questions or need help.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].