All Projects → ikegami-yukino → rakutenma-python

ikegami-yukino / rakutenma-python

Licence: Apache-2.0 License
Rakuten MA (Python version)

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to rakutenma-python

Jumanpp
Juman++ (a Morphological Analyzer Toolkit)
Stars: ✭ 254 (+1593.33%)
Mutual labels:  word-segmentation, pos-tagging, part-of-speech-tagger
Vncorenlp
A Vietnamese natural language processing toolkit (NAACL 2018)
Stars: ✭ 354 (+2260%)
Mutual labels:  word-segmentation, pos-tagging
Jptdp
Neural network models for joint POS tagging and dependency parsing (CoNLL 2017-2018)
Stars: ✭ 146 (+873.33%)
Mutual labels:  pos-tagging, part-of-speech-tagger
Lac
百度NLP:分词,词性标注,命名实体识别,词重要性
Stars: ✭ 2,792 (+18513.33%)
Mutual labels:  word-segmentation, part-of-speech-tagger
Qutuf
Qutuf (قُطُوْف): An Arabic Morphological analyzer and Part-Of-Speech tagger as an Expert System.
Stars: ✭ 84 (+460%)
Mutual labels:  pos-tagging, part-of-speech-tagger
Rdrpostagger
A fast and accurate POS and morphological tagging toolkit (EACL 2014)
Stars: ✭ 126 (+740%)
Mutual labels:  pos-tagging, part-of-speech-tagger
Pytorch Pos Tagging
A tutorial on how to implement models for part-of-speech tagging using PyTorch and TorchText.
Stars: ✭ 96 (+540%)
Mutual labels:  pos-tagging, part-of-speech-tagger
SynThai
Thai Word Segmentation and Part-of-Speech Tagging with Deep Learning
Stars: ✭ 41 (+173.33%)
Mutual labels:  word-segmentation, pos-tagging
Cws
Source code for an ACL2016 paper of Chinese word segmentation
Stars: ✭ 81 (+440%)
Mutual labels:  chinese, word-segmentation
Kagome
Self-contained Japanese Morphological Analyzer written in pure Go
Stars: ✭ 554 (+3593.33%)
Mutual labels:  japanese-language, pos-tagging
Articutapi
API of Articut 中文斷詞 (兼具語意詞性標記):「斷詞」又稱「分詞」,是中文資訊處理的基礎。Articut 不用機器學習,不需資料模型,只用現代白話中文語法規則,即能達到 SIGHAN 2005 F1-measure 94% 以上,Recall 96% 以上的成績。
Stars: ✭ 252 (+1580%)
Mutual labels:  pos-tagging, part-of-speech-tagger
datalinguist
Stanford CoreNLP in idiomatic Clojure.
Stars: ✭ 93 (+520%)
Mutual labels:  pos-tagging, part-of-speech-tagger
Nagisa
A Japanese tokenizer based on recurrent neural networks
Stars: ✭ 260 (+1633.33%)
Mutual labels:  word-segmentation, pos-tagging
Monpa
MONPA 罔拍是一個提供正體中文斷詞、詞性標註以及命名實體辨識的多任務模型
Stars: ✭ 203 (+1253.33%)
Mutual labels:  word-segmentation, pos-tagging
pytorch Joint-Word-Segmentation-and-POS-Tagging
Paper: A Simple and Effective Neural Model for Joint Word Segmentation and POS Tagging
Stars: ✭ 37 (+146.67%)
Mutual labels:  word-segmentation, pos-tagging
Pytorch-NLU
Pytorch-NLU,一个中文文本分类、序列标注工具包,支持中文长文本、短文本的多类、多标签分类任务,支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech ta…
Stars: ✭ 151 (+906.67%)
Mutual labels:  word-segmentation, pos-tagging
chinese-nlp-ner
一套针对中文实体识别的BLSTM-CRF解决方案
Stars: ✭ 14 (-6.67%)
Mutual labels:  chinese
udar
UDAR Does Accented Russian: A finite-state morphological analyzer of Russian that handles stressed wordforms.
Stars: ✭ 15 (+0%)
Mutual labels:  pos-tagging
cn-holiday
a lib for chinese holiday
Stars: ✭ 22 (+46.67%)
Mutual labels:  chinese
SymSpellCppPy
Fast SymSpell written in c++ and exposes to python via pybind11
Stars: ✭ 28 (+86.67%)
Mutual labels:  word-segmentation

Rakuten MA Python

travis-ci.org coveralls.io pyversion latest version Code Health license

Rakuten MA Python (morphological analyzer) is a Python version of Rakuten MA (word segmentor + PoS Tagger) for Chinese and Japanese.

For details about Rakuten MA, See https://github.com/rakuten-nlp/rakutenma

See also http://qiita.com/yukinoi/items/925bc238185aa2fad8a7 (In Japanese)

Contributions are welcome!

Installation

pip install rakutenma

Example

from rakutenma import RakutenMA

# Initialize a RakutenMA instance with an empty model
# the default ja feature set is set already
rma = RakutenMA()

# Let's analyze a sample sentence (from http://tatoeba.org/jpn/sentences/show/103809)
# With a disastrous result, since the model is empty!
print(rma.tokenize("彼は新しい仕事できっと成功するだろう。"))

# Feed the model with ten sample sentences from tatoeba.com
# "tatoeba.json" is available at https://github.com/rakuten-nlp/rakutenma
import json
tatoeba = json.load(open("tatoeba.json"))
for i in tatoeba:
    rma.train_one(i)

# Now what does the result look like?
print(rma.tokenize("彼は新しい仕事できっと成功するだろう。"))

# Initialize a RakutenMA instance with a pre-trained model
rma = RakutenMA(phi=1024, c=0.007812)  # Specify hyperparameter for SCW (for demonstration purpose)
rma.load("model_ja.json")

# Set the feature hash function (15bit)
rma.hash_func = rma.create_hash_func(15)

# Tokenize one sample sentence
print(rma.tokenize("うらにわにはにわにわとりがいる"));

# Re-train the model feeding the right answer (pairs of [token, PoS tag])
res = rma.train_one(
       [["うらにわ","N-nc"],
        ["に","P-k"],
        ["は","P-rj"],
        ["にわ","N-n"],
        ["にわとり","N-nc"],
        ["が","P-k"],
        ["いる","V-c"]])
# The result of train_one contains:
#   sys: the system output (using the current model)
#   ans: answer fed by the user
#   update: whether the model was updated
print(res)

# Now what does the result look like?
print(rma.tokenize("うらにわにはにわにわとりがいる"))

NOTE

Added API

As compared to original RakutenMA, following methods are added:

  • RakutenMA::load(model_path) - Load model from JSON file
  • RakutenMA::save(model_path) - Save model to path

misc

As initial setting, following values are set:

  • rma.featset = CTYPE_JA_PATTERNS # RakutenMA.default_featset_ja
  • rma.hash_func = rma.create_hash_func(15)
  • rma.tag_scheme = "SBIEO" # if using Chinese, set "IOB2"

LICENSE

Apache License version 2.0

Copyright

Rakuten MA Python (c) 2015- Yukino Ikegami. All Rights Reserved.

Rakuten MA (original) (c) 2014 Rakuten NLP Project. All Rights Reserved.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].