All Projects → ikegami-yukino → Neologdn

ikegami-yukino / Neologdn

Licence: apache-2.0
Japanese text normalizer for mecab-neologd

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Neologdn

japanese-pitch-accent-resources
Trying to consolidate japanese phonetic, and in particular pitch accent resources into one list
Stars: ✭ 64 (-65.41%)
Mutual labels:  japanese-language
Janome
Japanese morphological analysis engine written in pure Python
Stars: ✭ 630 (+240.54%)
Mutual labels:  japanese-language
Japanesetokenizers
aim to use JapaneseTokenizer as easy as possible
Stars: ✭ 120 (-35.14%)
Mutual labels:  japanese-language
rakutenma-python
Rakuten MA (Python version)
Stars: ✭ 15 (-91.89%)
Mutual labels:  japanese-language
Kagome
Self-contained Japanese Morphological Analyzer written in pure Go
Stars: ✭ 554 (+199.46%)
Mutual labels:  japanese-language
The Tab Of Words
A minimal Chrome / Firefox extension to help you learn Japanese words in each new tab.
Stars: ✭ 94 (-49.19%)
Mutual labels:  japanese-language
KanjiRecognitionDictionary
Perfect for those who forgets kanji pronunciation
Stars: ✭ 14 (-92.43%)
Mutual labels:  japanese-language
Negapoji
Japanese negative positive classification.日本語文書のネガポジを判定。
Stars: ✭ 148 (-20%)
Mutual labels:  japanese-language
Awesome Japanese
Awesome Japanese learning resource
Stars: ✭ 563 (+204.32%)
Mutual labels:  japanese-language
Scattertext
Beautiful visualizations of how language differs among document types.
Stars: ✭ 1,722 (+830.81%)
Mutual labels:  japanese-language
unofficial-jisho-api
Encapsulates the official Jisho.org API and also provides kanji, example, and stroke diagram search.
Stars: ✭ 88 (-52.43%)
Mutual labels:  japanese-language
Yomichan
Japanese pop-up dictionary extension for Chrome and Firefox.
Stars: ✭ 464 (+150.81%)
Mutual labels:  japanese-language
Languagepod101 Scraper
Python scraper for Language Pods such as Japanesepod101.com 👹 🗾 🍣 Compatible with Japanese, Chinese, French, German, Italian, Korean, Portuguese, Russian, Spanish and many more! ✨
Stars: ✭ 104 (-43.78%)
Mutual labels:  japanese-language
ebe-dataset
Evidence-based Explanation Dataset (AACL-IJCNLP 2020)
Stars: ✭ 16 (-91.35%)
Mutual labels:  japanese-language
Ichiran
Linguistic tools for texts in Japanese language
Stars: ✭ 120 (-35.14%)
Mutual labels:  japanese-language
madomagiOOP
👨‍💻♐ OOP learning with anime magical girl. (魔法少女で学ぶオブジェクト指向)🧙
Stars: ✭ 17 (-90.81%)
Mutual labels:  japanese-language
Oseti
Dictionary based Sentiment Analysis for Japanese
Stars: ✭ 49 (-73.51%)
Mutual labels:  japanese-language
Jaconv
Pure-Python Japanese character interconverter for Hiragana, Katakana, Hankaku and Zenkaku
Stars: ✭ 157 (-15.14%)
Mutual labels:  japanese-language
Kanji Koohii
A web application to help Japanese language learners remember the kanji.
Stars: ✭ 137 (-25.95%)
Mutual labels:  japanese-language
Topokanji
Topologically ordered lists of kanji for effective learning
Stars: ✭ 108 (-41.62%)
Mutual labels:  japanese-language

neologdn

|travis| |pyversion| |version| |landscape| |license|

neologdn is a Japanese text normalizer for mecab-neologd <https://github.com/neologd/mecab-ipadic-neologd>_.

The normalization is based on the neologd's rules: https://github.com/neologd/mecab-ipadic-neologd/wiki/Regexp.ja

Contributions are welcome!

NOTE: Installing this module requires C++11 compiler.

Installation

::

$ pip install neologdn

Usage

.. code:: python

import neologdn
neologdn.normalize("ハンカクカナ")
# => 'ハンカクカナ'
neologdn.normalize("全角記号!?@#")
# => '全角記号[email protected]#'
neologdn.normalize("全角記号例外「・」")
# => '全角記号例外「・」'
neologdn.normalize("長音短縮ウェーーーーイ")
# => '長音短縮ウェーイ'
neologdn.normalize("チルダ削除ウェ~∼∾〜〰~イ")
# => 'チルダ削除ウェイ'
neologdn.normalize("いろんなハイフン˗֊‐‑‒–⁃⁻₋−")
# => 'いろんなハイフン-'
neologdn.normalize("   PRML  副 読 本   ")
# => 'PRML副読本'
neologdn.normalize(" Natural Language Processing ")
# => 'Natural Language Processing'
neologdn.normalize("かわいいいいいいいいい", repeat=6)
# => 'かわいいいいいい'
neologdn.normalize("無駄無駄無駄無駄ァ", repeat=1)
# => '無駄ァ'

Benchmark

.. code:: python

# Sample code from
# https://github.com/neologd/mecab-ipadic-neologd/wiki/Regexp.ja#python-written-by-hideaki-t--overlast
import normalize_neologd

%timeit normalize(normalize_neologd.normalize_neologd)
# => 1 loop, best of 3: 18.3 s per loop


import neologdn
%timeit normalize(neologdn.normalize)
# => 1 loop, best of 3: 9.05 s per loop

neologdn is about x2 faster than sample code.

details are described as the below notebook: https://github.com/ikegami-yukino/neologdn/blob/master/benchmark/benchmark.ipynb

License

Apache Software License.

.. |travis| image:: https://travis-ci.org/ikegami-yukino/neologdn.svg?branch=master :target: https://travis-ci.org/ikegami-yukino/neologdn :alt: travis-ci.org

.. |version| image:: https://img.shields.io/pypi/v/neologdn.svg :target: http://pypi.python.org/pypi/neologdn/ :alt: latest version

.. |pyversion| image:: https://img.shields.io/pypi/pyversions/neologdn.svg

.. |landscape| image:: https://landscape.io/github/ikegami-yukino/neologdn/master/landscape.svg?style=flat :target: https://landscape.io/github/ikegami-yukino/neologdn/master :alt: Code Health

.. |license| image:: https://img.shields.io/pypi/l/neologdn.svg :target: http://pypi.python.org/pypi/neologdn/ :alt: license

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].