Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → ku-nlp → Jumanpp

ku-nlp / Jumanpp

Licence: apache-2.0

Juman++ (a Morphological Analyzer Toolkit)

Labels

nlp japanese tokenizer pos-tagging part-of-speech-tagger morphological-analysis word-segmentation

Projects that are alternatives of or similar to Jumanpp

Kagome

Self-contained Japanese Morphological Analyzer written in pure Go

Stars: ✭ 554 (+118.11%)

Mutual labels: japanese, tokenizer, morphological-analysis, pos-tagging

Nagisa

A Japanese tokenizer based on recurrent neural networks

Stars: ✭ 260 (+2.36%)

Mutual labels: japanese, pos-tagging, word-segmentation

Qutuf

Qutuf (قُطُوْف): An Arabic Morphological analyzer and Part-Of-Speech tagger as an Expert System.

Stars: ✭ 84 (-66.93%)

Mutual labels: morphological-analysis, pos-tagging, part-of-speech-tagger

rakutenma-python

Rakuten MA (Python version)

Stars: ✭ 15 (-94.09%)

Mutual labels: word-segmentation, pos-tagging, part-of-speech-tagger

sinling

A collection of NLP tools for Sinhalese (සිංහල).

Stars: ✭ 38 (-85.04%)

Mutual labels: tokenizer, pos-tagging

Fugashi

A Cython MeCab wrapper for fast, pythonic Japanese tokenization and morphological analysis.

Stars: ✭ 125 (-50.79%)

Mutual labels: japanese, tokenizer

grasp

Essential NLP & ML, short & fast pure Python code

Stars: ✭ 58 (-77.17%)

Mutual labels: tokenizer, part-of-speech-tagger

pytorch Joint-Word-Segmentation-and-POS-Tagging

Paper: A Simple and Effective Neural Model for Joint Word Segmentation and POS Tagging

Stars: ✭ 37 (-85.43%)

Mutual labels: word-segmentation, pos-tagging

suika

Suika 🍉 is a Japanese morphological analyzer written in pure Ruby

Stars: ✭ 31 (-87.8%)

Mutual labels: tokenizer, morphological-analysis

datalinguist

Stanford CoreNLP in idiomatic Clojure.

Stars: ✭ 93 (-63.39%)

Mutual labels: pos-tagging, part-of-speech-tagger

simplemma

Simple multilingual lemmatizer for Python, especially useful for speed and efficiency

Stars: ✭ 32 (-87.4%)

Mutual labels: tokenizer, morphological-analysis

Toiro

A comparison tool of Japanese tokenizers

Stars: ✭ 95 (-62.6%)

Mutual labels: japanese, word-segmentation

Kuromoji

Kuromoji is a self-contained and very easy to use Japanese morphological analyzer designed for search

Stars: ✭ 745 (+193.31%)

Mutual labels: japanese, part-of-speech-tagger

SynThai

Thai Word Segmentation and Part-of-Speech Tagging with Deep Learning

Stars: ✭ 41 (-83.86%)

Mutual labels: word-segmentation, pos-tagging

KWDLC

Kyoto University Web Document Leads Corpus

Stars: ✭ 64 (-74.8%)

Mutual labels: japanese, morphological-analysis

udar

UDAR Does Accented Russian: A finite-state morphological analyzer of Russian that handles stressed wordforms.

Stars: ✭ 15 (-94.09%)

Mutual labels: pos-tagging, morphological-analysis

GrammarEngine

Грамматический Словарь Русского Языка (+ английский, японский, etc)

Stars: ✭ 68 (-73.23%)

Mutual labels: part-of-speech-tagger, morphological-analysis

Ekphrasis

Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).

Stars: ✭ 433 (+70.47%)

Mutual labels: tokenizer, word-segmentation

Udpipe

R package for Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing Based on the UDPipe Natural Language Processing Toolkit

Stars: ✭ 160 (-37.01%)

Mutual labels: tokenizer, pos-tagging

Pytorch-NLU

Pytorch-NLU，一个中文文本分类、序列标注工具包，支持中文长文本、短文本的多类、多标签分类任务，支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech ta…

Stars: ✭ 151 (-40.55%)

Mutual labels: word-segmentation, pos-tagging

View All Similar Projects ➔

What is Juman++

A new morphological analyser that considers semantic plausibility of word sequences by using a recurrent neural network language model (RNNLM). Version 2 has better accuracy and greatly (>250x) improved analysis speed than the original Juman++.

Installation

System Requirements

OS: Linux, MacOS X or Windows.
Compiler: C++14 compatible
- For example gcc 5.1+, clang 3.4+, MSVC 2017
- We test on GCC and clang on Linux/MacOS, mingw64-gcc and MSVC2017 on Windows

CMake v3.1 or later

Read this document for CentOS and RHEL derivatives or non-CMake alternatives.

Building from a package

Download the package from Releases

Important: The download should be around 300 MB. If it is not you have probably downloaded a source snapshot which does not contain a model.

$ tar xf jumanpp-<version>.tar.xz # decompress the package
$ cd jumanpp-<version> # move into the directory
$ mkdir bld # make a subdirectory for build
$ cd bld
$ cmake .. \
  -DCMAKE_BUILD_TYPE=Release \ # you want to do this for performance
  -DCMAKE_INSTALL_PREFIX=<prefix> # where to install Juman++
$ make install -j<parallelism>

Building from git

Important: Only the package distribution contains a pretrained model and can be used for analysis. The current git version is not compatible with the models of 2.0-rc1 and 2.0-rc2.

$ mkdir cmake-build-dir # CMake does not support in-source builds
$ cd cmake-build-dir
$ cmake ..
$ make # -j

Usage

Quick start

% echo "魅力がたっぷりと詰まっている" | jumanpp
魅力 みりょく 魅力 名詞 6 普通名詞 1 * 0 * 0 "代表表記:魅力/みりょく カテゴリ:抽象物"
が が が 助詞 9 格助詞 1 * 0 * 0 NIL
たっぷり たっぷり たっぷり 副詞 8 * 0 * 0 * 0 "自動認識"
と と と 助詞 9 格助詞 1 * 0 * 0 NIL
詰まって つまって 詰まる 動詞 2 * 0 子音動詞ラ行 10 タ系連用テ形 14 "代表表記:詰まる/つまる ドメイン:料理・食事 自他動詞:他:詰める/つめる"
いる いる いる 接尾辞 14 動詞性接尾辞 7 母音動詞 1 基本形 2 "代表表記:いる/いる"
EOS

Main options

usage: jumanpp [options] 
  -s, --specifics              lattice format output (unsigned int [=5])
  --beam <int>                 set local beam width used in analysis (unsigned int [=5])
  -v, --version                print version
  -h, --help                   print this message
  --model <file>               specify a model location

Use --help to see more options.

Input

JUMAN++ can handle only utf-8 encoded text as an input. Lines beginning with # will be interpreted as comments.

Other

DEMO

You can play around our web demo which displays a subset of the whole lattice. The demo still uses v1 but, it will be updated to v2 soon.

Extracting diffs caused by beam configurations

You can see sentences in which two different beam configurations produce different analyses. A src/jumandic/jpp_jumandic_pathdiff binary (source) (relative to a compilation root) does it. The only Jumandic-specific thing here is the usage of code-generated linear model inference.

Use the binary as jpp_jumandic_pathdiff <model> <input> > <output>.

Outputs would be in the partial annotation format with a full beam results being the actual tags and trimmed beam results being written as comments.

Example:

# scores: -0.602687 -1.20004
# 子がい        pos:名詞        subpos:普通名詞 <------- trimmed beam result
# S-ID:w201007-0080605751-6 COUNT:2
熊本選抜にはマリノス、アントラーズのユースに行く
        子      pos:名詞        subpos:普通名詞 <------- full beam result
        が      pos:助詞        subpos:格助詞
        い      baseform:いる   conjtype:母音動詞       pos:動詞        conjform:基本連用形
ます

Partial Annotation Tool

We also have a partial annotation tool. Please see https://github.com/eiennohito/nlp-tools-demo for details.

Performance Notes

To get the best performance, you need to build with extended instruction sets. If you are planning to use Juman++ only locally, specify -DCMAKE_CXX_FLAGS="-march=native".

Works best on Intel Haswell and newer processors (because of FMA and BMI instruction set extensions).

Using Juman++ to create your own Morphological Analyzer

Juman++ is a general tool. It does not depend on Jumandic or Japanese Language (albeit there are some Japanese-specific functionality). See this tutorial project which shows how to implement a something similar to a T9 text input for the case when there are no word boundaries in the input text.

Publications and Slides

About the model itself: Morphological Analysis for Unsegmented Languages using Recurrent Neural Network Language Model. Hajime Morita, Daisuke Kawahara, Sadao Kurohashi. EMNLP 2015 link, bibtex.
V2 Improvments: Juman++ v2: A Practical and Modern Morphological Analyzer. Arseny Tolmachev and Kurohashi Sadao. The Proceedings of the Twenty-fourth Annual Meeting of the Association for Natural Language Processing. March 2018, Okayama, Japan. (pdf, slides)
Morphological Analysis Workshop in ANLP2018 Slides: 形態素解析システムJuman++. 河原大輔, Arseny Tolmachev. (in Japanese) slides.
Juman++: A Morphological Analysis Toolkit for Scriptio Continua. Arseny Tolmachev, Daisuke Kawahara and Sadao Kurohashi. EMNLP 2018, Brussels. pdf, poster, bibtex.
Design and Structure of The Juman++ Morphological Analyzer Toolkit. Arseny Tolmachev, Daisuke Kawahara, Sadao Kurohashi. Journal of Natural Language Processing, (paper, bibtex).

If you use Juman++ V1 in academic setting, then please cite the first work (EMNLP2015). If you use Juman++ V2, then please cite both the first and the fourth (EMNLP2018) papers.

Authors

Arseny Tolmachev <arseny at kotonoha.ws>
Hajime Morita <hmorita at nlp.ist.i.kyoto-u.ac.jp>
Daisuke Kawahara <dk at i.kyoto-u.ac.jp>
Sadao Kurohashi <kuro at i.kyoto-u.ac.jp>

Acknowledgement

The list of all libraries used by JUMAN++ is here.

Notice

This is a branch for the Juman++ rewrite. The original version lives in the legacy branch.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 254

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (30) 🔗