Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → taishi-i → Nagisa

taishi-i / Nagisa

Licence: mit

A Japanese tokenizer based on recurrent neural networks

Programming Languages

python

139335 projects - #7 most used programming language

Labels

japanese nlp-library sequence-labeling pos-tagging word-segmentation

Projects that are alternatives of or similar to Nagisa

Pytorch-NLU

Pytorch-NLU，一个中文文本分类、序列标注工具包，支持中文长文本、短文本的多类、多标签分类任务，支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech ta…

Stars: ✭ 151 (-41.92%)

Mutual labels: word-segmentation, pos-tagging, sequence-labeling

Toiro

A comparison tool of Japanese tokenizers

Stars: ✭ 95 (-63.46%)

Mutual labels: japanese, nlp-library, word-segmentation

Kagome

Self-contained Japanese Morphological Analyzer written in pure Go

Stars: ✭ 554 (+113.08%)

Mutual labels: japanese, nlp-library, pos-tagging

Jumanpp

Juman++ (a Morphological Analyzer Toolkit)

Stars: ✭ 254 (-2.31%)

Mutual labels: japanese, pos-tagging, word-segmentation

Sudachipy

Python version of Sudachi, a Japanese tokenizer.

Stars: ✭ 207 (-20.38%)

Mutual labels: nlp-library, pos-tagging

Pythainlp

Thai Natural Language Processing in Python.

Stars: ✭ 582 (+123.85%)

Mutual labels: nlp-library, word-segmentation

rakutenma-python

Rakuten MA (Python version)

Stars: ✭ 15 (-94.23%)

Mutual labels: word-segmentation, pos-tagging

Kuromoji

Kuromoji is a self-contained and very easy to use Japanese morphological analyzer designed for search

Stars: ✭ 745 (+186.54%)

Mutual labels: japanese, nlp-library

Monpa

MONPA 罔拍是一個提供正體中文斷詞、詞性標註以及命名實體辨識的多任務模型

Stars: ✭ 203 (-21.92%)

Mutual labels: pos-tagging, word-segmentation

Multi Task Nlp

multi_task_NLP is a utility toolkit enabling NLP developers to easily train and infer a single model for multiple tasks.

Stars: ✭ 221 (-15%)

Mutual labels: nlp-library, sequence-labeling

A Pytorch Tutorial To Sequence Labeling

Empower Sequence Labeling with Task-Aware Neural Language Model | a PyTorch Tutorial to Sequence Labeling

Stars: ✭ 257 (-1.15%)

Mutual labels: sequence-labeling, pos-tagging

Sudachi

A Japanese Tokenizer for Business

Stars: ✭ 496 (+90.77%)

Mutual labels: nlp-library, pos-tagging

Ekphrasis

Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).

Stars: ✭ 433 (+66.54%)

Mutual labels: nlp-library, word-segmentation

fairseq-tagging

a Fairseq fork for sequence tagging/labeling tasks

Stars: ✭ 26 (-90%)

Mutual labels: pos-tagging, sequence-labeling

Pytorch ner bilstm cnn crf

End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF implement in pyotrch

Stars: ✭ 249 (-4.23%)

Mutual labels: sequence-labeling, pos-tagging

sequence labeling tf

Sequence Labeling in Tensorflow

Stars: ✭ 18 (-93.08%)

Mutual labels: pos-tagging, sequence-labeling

Vncorenlp

A Vietnamese natural language processing toolkit (NAACL 2018)

Stars: ✭ 354 (+36.15%)

Mutual labels: pos-tagging, word-segmentation

SynThai

Thai Word Segmentation and Part-of-Speech Tagging with Deep Learning

Stars: ✭ 41 (-84.23%)

Mutual labels: word-segmentation, pos-tagging

pytorch Joint-Word-Segmentation-and-POS-Tagging

Paper: A Simple and Effective Neural Model for Joint Word Segmentation and POS Tagging

Stars: ✭ 37 (-85.77%)

Mutual labels: word-segmentation, pos-tagging

Nuts

自然语言处理常见任务（主要包括文本分类，序列标注，自动问答等）解决方案试验田

Stars: ✭ 21 (-91.92%)

Mutual labels: nlp-library, sequence-labeling

View All Similar Projects ➔

Nagisa is a python module for Japanese word segmentation/POS-tagging. It is designed to be a simple and easy-to-use tool.

This tool has the following features.

Based on recurrent neural networks.
The word segmentation model uses character- and word-level features [池田+].
The POS-tagging model uses tag dictionary information [Inoue+].

For more details refer to the following links.

The slides at PyCon JP 2019 is available here.
The article in Japanese is available here.
The documentation is available here.

Installation

Python 2.7.x or 3.5+ is required. This tool uses DyNet (the Dynamic Neural Network Toolkit) to calcucate neural networks. You can install nagisa by using the following command.

pip install nagisa

For Windows users, please run it with python 3.6 or 3.7 (64bit).

Basic usage

Sample of word segmentation and POS-tagging for Japanese.

import nagisa

text = 'Pythonで簡単に使えるツールです'
words = nagisa.tagging(text)
print(words)
#=> Python/名詞 で/助詞 簡単/形状詞 に/助動詞 使える/動詞 ツール/名詞 です/助動詞

# Get a list of words
print(words.words)
#=> ['Python', 'で', '簡単', 'に', '使える', 'ツール', 'です']

# Get a list of POS-tags
print(words.postags)
#=> ['名詞', '助詞', '形状詞', '助動詞', '動詞', '名詞', '助動詞']

Post-processing functions

Filter and extarct words by the specific POS tags.

# Filter the words of the specific POS tags.
words = nagisa.filter(text, filter_postags=['助詞', '助動詞'])
print(words)
#=> Python/名詞 簡単/形状詞 使える/動詞 ツール/名詞

# Extarct only nouns.
words = nagisa.extract(text, extract_postags=['名詞'])
print(words)
#=> Python/名詞 ツール/名詞

# This is a list of available POS-tags in nagisa.
print(nagisa.tagger.postags)
#=> ['補助記号', '名詞', ... , 'URL']

Add the user dictionary in easy way.

# default
text = "3月に見た「3月のライオン」"
print(nagisa.tagging(text))
#=> 3/名詞 月/名詞 に/助詞 見/動詞 た/助動詞 「/補助記号 3/名詞 月/名詞 の/助詞 ライオン/名詞 」/補助記号

# If a word ("3月のライオン") is included in the single_word_list, it is recognized as a single word.
new_tagger = nagisa.Tagger(single_word_list=['3月のライオン'])
print(new_tagger.tagging(text))
#=> 3/名詞 月/名詞 に/助詞 見/動詞 た/助動詞 「/補助記号 3月のライオン/名詞 」/補助記号

Train a model

Nagisa (v0.2.0+) provides a simple train method for a joint word segmentation and sequence labeling (e.g, POS-tagging, NER) model.

The format of the train/dev/test files is tsv. Each line is word and tag and one line is represented by word \t(tab) tag. Note that you put EOS between sentences. Refer to sample datasets and tutorial (Train a model for Universal Dependencies).

$ cat sample.train
唯一	NOUN
の	ADP
趣味	NOU
は	ADP
料理	NOUN
EOS
とても	ADV
おいしかっ	ADJ
た	AUX
です	AUX
。	PUNCT
EOS
ドル	NOUN
は	ADP
主要	ADJ
通貨	NOUN
EOS

# After finish training, save the three model files (*.vocabs, *.params, *.hp).
nagisa.fit(train_file="sample.train", dev_file="sample.dev", test_file="sample.test", model_name="sample")

# Build the tagger by loading the trained model files.
sample_tagger = nagisa.Tagger(vocabs='sample.vocabs', params='sample.params', hp='sample.hp')

text = "福岡・博多の観光情報"
words = sample_tagger.tagging(text)
print(words)
#> 福岡/PROPN ・/SYM 博多/PROPN の/ADP 観光/NOUN 情報/NOUN

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 260

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (3) 🔗