All Projects → miurahr → Pykakasi

miurahr / Pykakasi

Licence: gpl-3.0
NLP: Convert Japanese Kana-kanji sentences into Kana-Roman in simple algorithm.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Pykakasi

Toiro
A comparison tool of Japanese tokenizers
Stars: ✭ 95 (-60.08%)
Mutual labels:  japanese, natural-language-processing
Awesome Bert Japanese
📝 A list of pre-trained BERT models for Japanese with word/subword tokenization + vocabulary construction algorithm information
Stars: ✭ 76 (-68.07%)
Mutual labels:  japanese, natural-language-processing
Nagisa Tutorial Pycon2019
Code for PyCon JP 2019 talk "Python による日本語自然言語処理 〜系列ラベリングによる実世界テキスト分析〜"
Stars: ✭ 46 (-80.67%)
Mutual labels:  japanese, natural-language-processing
Konoha
🌿 An easy-to-use Japanese Text Processing tool, which makes it possible to switch tokenizers with small changes of code.
Stars: ✭ 130 (-45.38%)
Mutual labels:  japanese, natural-language-processing
Pytorch Transformers Classification
Based on the Pytorch-Transformers library by HuggingFace. To be used as a starting point for employing Transformer models in text classification tasks. Contains code to easily train BERT, XLNet, RoBERTa, and XLM models for text classification.
Stars: ✭ 229 (-3.78%)
Mutual labels:  natural-language-processing
Reside
EMNLP 2018: RESIDE: Improving Distantly-Supervised Neural Relation Extraction using Side Information
Stars: ✭ 222 (-6.72%)
Mutual labels:  natural-language-processing
Dilated Cnn Ner
Dilated CNNs for NER in TensorFlow
Stars: ✭ 222 (-6.72%)
Mutual labels:  natural-language-processing
Textlint Rule Preset Ja Technical Writing
技術文書向けのtextlintルールプリセット
Stars: ✭ 218 (-8.4%)
Mutual labels:  japanese
Pytorch Bert Crf Ner
KoBERT와 CRF로 만든 한국어 개체명인식기 (BERT+CRF based Named Entity Recognition model for Korean)
Stars: ✭ 236 (-0.84%)
Mutual labels:  natural-language-processing
Genki Study Resources
A collection of exercises for practicing what is taught in Genki: An Integrated Course in Elementary Japanese.
Stars: ✭ 232 (-2.52%)
Mutual labels:  japanese
Catalyst
🚀 Catalyst is a C# Natural Language Processing library built for speed. Inspired by spaCy's design, it brings pre-trained models, out-of-the box support for training word and document embeddings, and flexible entity recognition models.
Stars: ✭ 224 (-5.88%)
Mutual labels:  natural-language-processing
Machine Learning Notebooks
Machine Learning notebooks for refreshing concepts.
Stars: ✭ 222 (-6.72%)
Mutual labels:  natural-language-processing
Wordgcn
ACL 2019: Incorporating Syntactic and Semantic Information in Word Embeddings using Graph Convolutional Networks
Stars: ✭ 230 (-3.36%)
Mutual labels:  natural-language-processing
Bert4doc Classification
Code and source for paper ``How to Fine-Tune BERT for Text Classification?``
Stars: ✭ 220 (-7.56%)
Mutual labels:  natural-language-processing
Deepnlp Models Pytorch
Pytorch implementations of various Deep NLP models in cs-224n(Stanford Univ)
Stars: ✭ 2,760 (+1059.66%)
Mutual labels:  natural-language-processing
Ai Job Resume
AI 算法岗简历模板
Stars: ✭ 219 (-7.98%)
Mutual labels:  natural-language-processing
Ja.javascript.info
現代の JavaScript チュートリアル
Stars: ✭ 226 (-5.04%)
Mutual labels:  japanese
Spacy Services
💫 REST microservices for various spaCy-related tasks
Stars: ✭ 230 (-3.36%)
Mutual labels:  natural-language-processing
Text summarization with tensorflow
Implementation of a seq2seq model for summarization of textual data. Demonstrated on amazon reviews, github issues and news articles.
Stars: ✭ 226 (-5.04%)
Mutual labels:  natural-language-processing
Catalyst
Accelerated deep learning R&D
Stars: ✭ 2,804 (+1078.15%)
Mutual labels:  natural-language-processing

======== Pykakasi

Overview

.. image:: https://readthedocs.org/projects/pykakasi/badge/?version=latest :target: https://pykakasi.readthedocs.io/en/latest/?badge=latest :alt: Documentation Status

.. image:: https://badge.fury.io/py/pykakasi.png :target: http://badge.fury.io/py/Pykakasi :alt: PyPI version

.. image:: https://github.com/miurahr/pykakasi/workflows/Run%20Tox%20tests/badge.svg :target: https://github.com/miurahr/pykakasi/actions?query=workflow%3A%22Run+Tox+tests%22 :alt: Run Tox tests

.. image:: https://dev.azure.com/miurahr/github/_apis/build/status/miurahr.pykakasi?branchName=master :target: https://dev.azure.com/miurahr/github/_build?definitionId=13&branchName=master :alt: Azure-Pipelines

.. image:: https://coveralls.io/repos/miurahr/pykakasi/badge.svg?branch=master :target: https://coveralls.io/r/miurahr/pykakasi?branch=master :alt: Coverage status

pykakasi is a Python Natural Language Processing (NLP) library to transliterate hiragana, katakana and kanji (Japanese text) into rōmaji (Latin/Roman alphabet). It can handle characters in NFC form.

It is based on the kakasi_ library, which is written in C.

  • Install (from PyPI_): pip install pykakasi
  • Documentation available on readthedocs_

.. _PyPI: https://pypi.org/project/pykakasi/ .. _kakasi: http://kakasi.namazu.org/ .. _Documentation available on readthedocs: https://pykakasi.readthedocs.io/en/latest/index.html

Supported python versions

  • pykakasi 1.2 supports python 2.7, python 3.5, 3.6, 3.7

  • pykakasi 2.0 supports python 3.6, 3.7, 3.8, pypy3.6-7.1.1

Usage

Here is an usage of NewAPI for pykakasi v2.0.0 and later. Transliterate Japanese text to kana, hiragana and romaji:

.. code-block:: python

import pykakasi
kks = pykakasi.kakasi()
text = "かな漢字"
result = kks.convert(text)
for item in result:
    print("{}: kana '{}', hiragana '{}', romaji: '{}'".format(item['orig'], item['kana'], item['hira'], item['hepburn']))

かな: kana 'カナ', hiragana: 'かな', romaji: 'kana'
漢字: kana 'カンジ', hiragana: 'かんじ', romaji: 'kanji'

Here is an example that output as similar with furigana mode.

.. code-block:: python

import pykakasi
kks = pykakasi.kakasi()
text = "かな漢字交じり文"
result = kks.convert(text)
for item in result:
    print("{}[{}] ".format(item['orig'], item['hepburn'].capitalize()), end='')
print()

かな[Kana] 漢字[Kanji] 交じり[Majiri] 文[Bun]

Old API

There is also an old API for v1.2.

Transliterate Japanese text to rōmaji:

.. code-block:: pycon

>>> import pykakasi
>>>
>>> text = u"かな漢字交じり文"
>>> kakasi = pykakasi.kakasi()
>>> kakasi.setMode("H","a") # Hiragana to ascii, default: no conversion
>>> kakasi.setMode("K","a") # Katakana to ascii, default: no conversion
>>> kakasi.setMode("J","a") # Japanese to ascii, default: no conversion
>>> kakasi.setMode("r","Hepburn") # default: use Hepburn Roman table
>>> kakasi.setMode("s", True) # add space, default: no separator
>>> kakasi.setMode("C", True) # capitalize, default: no capitalize
>>> conv = kakasi.getConverter()
>>> result = conv.do(text)
>>> print(result)
kana Kanji Majiri Bun

Tokenize Japanese text (split by word boundaries), equivalent to kakasi's wakati gaki option:

.. code-block:: pycon

>>> wakati = pykakasi.wakati()
>>> conv = wakati.getConverter()
>>> result = conv.do(text)
>>> print(result)
かな 漢字 交じり 文

Add furigana_ (pronounciation aid) in rōmaji to text:

.. code-block:: pycon

>>> kakasi = pykakasi.kakasi()
>>> kakasi.setMode("J","aF") # Japanese to furigana
>>> kakasi.setMode("H","aF") # Japanese to furigana
>>> conv = kakasi.getConverter()
>>> result = conv.do(text)
>>> print(result)
かな[kana] 漢字[Kanji] 交じり[Majiri] 文[Bun]

Input mode values: "J" (Japanese: kanji, hiragana and katakana), "H" (hiragana), "K" (katakana).

Output mode values: "H" (hiragana), "K" (katakana), "a" (alphabet / rōmaji), "aF" (furigana in rōmaji).

There are other setMode switches which control output:

  • "r": Romanisation table: Hepburn_ (default), Kunrei_ or Passport
  • "s": Separator: False adds no spaces between words (default), True adds spaces between words
  • "C": Capitalize: False adds no capital letters (default), True makes each word start with a capital letter

.. _furigana: https://en.wikipedia.org/wiki/Furigana .. _Hepburn: https://en.wikipedia.org/wiki/Hepburn_romanization .. _Kunrei: https://en.wikipedia.org/wiki/Kunrei-shiki_romanization

Copyright and License

Copyright 2010-2020 Hiroshi Miura [email protected]

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].