All Projects → taishi-i → Nagisa

taishi-i / Nagisa

Licence: mit
A Japanese tokenizer based on recurrent neural networks

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Nagisa

Pytorch-NLU
Pytorch-NLU,一个中文文本分类、序列标注工具包,支持中文长文本、短文本的多类、多标签分类任务,支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech ta…
Stars: ✭ 151 (-41.92%)
Mutual labels:  word-segmentation, pos-tagging, sequence-labeling
Toiro
A comparison tool of Japanese tokenizers
Stars: ✭ 95 (-63.46%)
Mutual labels:  japanese, nlp-library, word-segmentation
Kagome
Self-contained Japanese Morphological Analyzer written in pure Go
Stars: ✭ 554 (+113.08%)
Mutual labels:  japanese, nlp-library, pos-tagging
Jumanpp
Juman++ (a Morphological Analyzer Toolkit)
Stars: ✭ 254 (-2.31%)
Mutual labels:  japanese, pos-tagging, word-segmentation
Sudachipy
Python version of Sudachi, a Japanese tokenizer.
Stars: ✭ 207 (-20.38%)
Mutual labels:  nlp-library, pos-tagging
Pythainlp
Thai Natural Language Processing in Python.
Stars: ✭ 582 (+123.85%)
Mutual labels:  nlp-library, word-segmentation
rakutenma-python
Rakuten MA (Python version)
Stars: ✭ 15 (-94.23%)
Mutual labels:  word-segmentation, pos-tagging
Kuromoji
Kuromoji is a self-contained and very easy to use Japanese morphological analyzer designed for search
Stars: ✭ 745 (+186.54%)
Mutual labels:  japanese, nlp-library
Monpa
MONPA 罔拍是一個提供正體中文斷詞、詞性標註以及命名實體辨識的多任務模型
Stars: ✭ 203 (-21.92%)
Mutual labels:  pos-tagging, word-segmentation
Multi Task Nlp
multi_task_NLP is a utility toolkit enabling NLP developers to easily train and infer a single model for multiple tasks.
Stars: ✭ 221 (-15%)
Mutual labels:  nlp-library, sequence-labeling
A Pytorch Tutorial To Sequence Labeling
Empower Sequence Labeling with Task-Aware Neural Language Model | a PyTorch Tutorial to Sequence Labeling
Stars: ✭ 257 (-1.15%)
Mutual labels:  sequence-labeling, pos-tagging
Sudachi
A Japanese Tokenizer for Business
Stars: ✭ 496 (+90.77%)
Mutual labels:  nlp-library, pos-tagging
Ekphrasis
Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).
Stars: ✭ 433 (+66.54%)
Mutual labels:  nlp-library, word-segmentation
fairseq-tagging
a Fairseq fork for sequence tagging/labeling tasks
Stars: ✭ 26 (-90%)
Mutual labels:  pos-tagging, sequence-labeling
Pytorch ner bilstm cnn crf
End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF implement in pyotrch
Stars: ✭ 249 (-4.23%)
Mutual labels:  sequence-labeling, pos-tagging
sequence labeling tf
Sequence Labeling in Tensorflow
Stars: ✭ 18 (-93.08%)
Mutual labels:  pos-tagging, sequence-labeling
Vncorenlp
A Vietnamese natural language processing toolkit (NAACL 2018)
Stars: ✭ 354 (+36.15%)
Mutual labels:  pos-tagging, word-segmentation
SynThai
Thai Word Segmentation and Part-of-Speech Tagging with Deep Learning
Stars: ✭ 41 (-84.23%)
Mutual labels:  word-segmentation, pos-tagging
pytorch Joint-Word-Segmentation-and-POS-Tagging
Paper: A Simple and Effective Neural Model for Joint Word Segmentation and POS Tagging
Stars: ✭ 37 (-85.77%)
Mutual labels:  word-segmentation, pos-tagging
Nuts
自然语言处理常见任务(主要包括文本分类,序列标注,自动问答等)解决方案试验田
Stars: ✭ 21 (-91.92%)
Mutual labels:  nlp-library, sequence-labeling


Codacy Badge Build Status Build status Coverage Status Documentation Status PyPI Downloads

Nagisa is a python module for Japanese word segmentation/POS-tagging. It is designed to be a simple and easy-to-use tool.

This tool has the following features.

  • Based on recurrent neural networks.
  • The word segmentation model uses character- and word-level features [池田+].
  • The POS-tagging model uses tag dictionary information [Inoue+].

For more details refer to the following links.

  • The slides at PyCon JP 2019 is available here.
  • The article in Japanese is available here.
  • The documentation is available here.

Installation

Python 2.7.x or 3.5+ is required. This tool uses DyNet (the Dynamic Neural Network Toolkit) to calcucate neural networks. You can install nagisa by using the following command.

pip install nagisa

For Windows users, please run it with python 3.6 or 3.7 (64bit).

Basic usage

Sample of word segmentation and POS-tagging for Japanese.

import nagisa

text = 'Pythonで簡単に使えるツールです'
words = nagisa.tagging(text)
print(words)
#=> Python/名詞 で/助詞 簡単/形状詞 に/助動詞 使える/動詞 ツール/名詞 です/助動詞

# Get a list of words
print(words.words)
#=> ['Python', 'で', '簡単', 'に', '使える', 'ツール', 'です']

# Get a list of POS-tags
print(words.postags)
#=> ['名詞', '助詞', '形状詞', '助動詞', '動詞', '名詞', '助動詞']

Post-processing functions

Filter and extarct words by the specific POS tags.

# Filter the words of the specific POS tags.
words = nagisa.filter(text, filter_postags=['助詞', '助動詞'])
print(words)
#=> Python/名詞 簡単/形状詞 使える/動詞 ツール/名詞

# Extarct only nouns.
words = nagisa.extract(text, extract_postags=['名詞'])
print(words)
#=> Python/名詞 ツール/名詞

# This is a list of available POS-tags in nagisa.
print(nagisa.tagger.postags)
#=> ['補助記号', '名詞', ... , 'URL']

Add the user dictionary in easy way.

# default
text = "3月に見た「3月のライオン」"
print(nagisa.tagging(text))
#=> 3/名詞 月/名詞 に/助詞 見/動詞 た/助動詞 「/補助記号 3/名詞 月/名詞 の/助詞 ライオン/名詞 」/補助記号

# If a word ("3月のライオン") is included in the single_word_list, it is recognized as a single word.
new_tagger = nagisa.Tagger(single_word_list=['3月のライオン'])
print(new_tagger.tagging(text))
#=> 3/名詞 月/名詞 に/助詞 見/動詞 た/助動詞 「/補助記号 3月のライオン/名詞 」/補助記号

Train a model

Nagisa (v0.2.0+) provides a simple train method for a joint word segmentation and sequence labeling (e.g, POS-tagging, NER) model.

The format of the train/dev/test files is tsv. Each line is word and tag and one line is represented by word \t(tab) tag. Note that you put EOS between sentences. Refer to sample datasets and tutorial (Train a model for Universal Dependencies).

$ cat sample.train
唯一	NOUN
の	ADP
趣味	NOU
は	ADP
料理	NOUN
EOS
とても	ADV
おいしかっ	ADJ
た	AUX
です	AUX
。	PUNCT
EOS
ドル	NOUN
は	ADP
主要	ADJ
通貨	NOUN
EOS
# After finish training, save the three model files (*.vocabs, *.params, *.hp).
nagisa.fit(train_file="sample.train", dev_file="sample.dev", test_file="sample.test", model_name="sample")

# Build the tagger by loading the trained model files.
sample_tagger = nagisa.Tagger(vocabs='sample.vocabs', params='sample.params', hp='sample.hp')

text = "福岡・博多の観光情報"
words = sample_tagger.tagging(text)
print(words)
#> 福岡/PROPN ・/SYM 博多/PROPN の/ADP 観光/NOUN 情報/NOUN
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].