Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → polm → Fugashi

polm / Fugashi

Licence: other

A Cython MeCab wrapper for fast, pythonic Japanese tokenization and morphological analysis.

Labels

nlp japanese tokenizer

Projects that are alternatives of or similar to Fugashi

Kagome

Self-contained Japanese Morphological Analyzer written in pure Go

Stars: ✭ 554 (+343.2%)

Mutual labels: japanese, tokenizer

Jumanpp

Juman++ (a Morphological Analyzer Toolkit)

Stars: ✭ 254 (+103.2%)

Mutual labels: japanese, tokenizer

Epub Manga Creator

a web GUI for create japanese epub manga

Stars: ✭ 90 (-28%)

Mutual labels: japanese

Ichiran

Linguistic tools for texts in Japanese language

Stars: ✭ 120 (-4%)

Mutual labels: japanese

Kadot

Kadot, the unsupervised natural language processing library.

Stars: ✭ 108 (-13.6%)

Mutual labels: tokenizer

The Tab Of Words

A minimal Chrome / Firefox extension to help you learn Japanese words in each new tab.

Stars: ✭ 94 (-24.8%)

Mutual labels: japanese

Posuto

🏣📮〠 Japanese postal code data.

Stars: ✭ 109 (-12.8%)

Mutual labels: japanese

Somajo

A tokenizer and sentence splitter for German and English web and social media texts.

Stars: ✭ 85 (-32%)

Mutual labels: tokenizer

Chevrotain

Parser Building Toolkit for JavaScript

Stars: ✭ 1,795 (+1336%)

Mutual labels: tokenizer

Languagepod101 Scraper

Python scraper for Language Pods such as Japanesepod101.com 👹 🗾 🍣 Compatible with Japanese, Chinese, French, German, Italian, Korean, Portuguese, Russian, Spanish and many more! ✨

Stars: ✭ 104 (-16.8%)

Mutual labels: japanese

Japanesetokenizers

aim to use JapaneseTokenizer as easy as possible

Stars: ✭ 120 (-4%)

Mutual labels: tokenizer

Megamark

😻 Markdown with easy tokenization, a fast highlighter, and a lean HTML sanitizer

Stars: ✭ 100 (-20%)

Mutual labels: tokenizer

Toiro

A comparison tool of Japanese tokenizers

Stars: ✭ 95 (-24%)

Mutual labels: japanese

Textlint Rule Preset Jtf Style

JTF日本語標準スタイルガイド for textlint.

Stars: ✭ 112 (-10.4%)

Mutual labels: japanese

Jconv

Pure-JavaScript converter for Japanese character encodings.

Stars: ✭ 91 (-27.2%)

Mutual labels: japanese

Gse

Go efficient multilingual NLP and text segmentation; support english, chinese, japanese and other. Go 高性能多语言 NLP 和分词

Stars: ✭ 1,695 (+1256%)

Mutual labels: japanese

Cheatsheet Of Ui With Fuzzy Behaviors

挙動や仕様が曖昧なユーザインタフェースチートシート

Stars: ✭ 89 (-28.8%)

Mutual labels: japanese

Source Han Code Jp

Source Han Code JP | 源ノ角ゴシック Code

Stars: ✭ 1,362 (+989.6%)

Mutual labels: japanese

Topokanji

Topologically ordered lists of kanji for effective learning

Stars: ✭ 108 (-13.6%)

Mutual labels: japanese

Cutlet

Japanese to romaji converter in Python

Stars: ✭ 124 (-0.8%)

Mutual labels: japanese

View All Similar Projects ➔

fugashi

fugashi is a Cython wrapper for MeCab, a Japanese tokenizer and morphological analysis tool. Wheels are provided for Linux, OSX, and Win64, and UniDic is easy to install.

issueを英語で書く必要はありません。

Check out the interactive demo, see the blog post for background on why fugashi exists and some of the design decisions, or see this guide for a basic introduction to Japanese tokenization.

If you are on an unsupported platform (like PowerPC), you'll need to install MeCab first. It's recommended you install from source.

Usage

from fugashi import Tagger

tagger = Tagger('-Owakati')
text = "麩菓子は、麩を主材料とした日本の菓子。"
tagger.parse(text)
# => '麩 菓子 は 、 麩 を 主材 料 と し た 日本 の 菓子 。'
for word in tagger(text):
    print(word, word.feature.lemma, word.pos, sep='\t')
    # "feature" is the Unidic feature data as a named tuple

Installing a Dictionary

fugashi requires a dictionary. UniDic is recommended, and two easy-to-install versions are provided.

unidic-lite, a 2013 version of Unidic that's relatively small
unidic, the latest UniDic 2.3.0, which is 1GB on disk and requires a separate download step

If you just want to make sure things work you can start with unidic-lite, but for more serious processing unidic is recommended. For production use you'll generally want to generate your own dictionary too; for details see the MeCab documentation.

To get either of these dictionaries, you can install them directly using pip or do the below:

pip install fugashi[unidic-lite]

# The full version of UniDic requires a separate download step
pip install fugashi[unidic]
python -m unidic download

For more information on the different MeCab dictionaries available, see this article.

Dictionary Use

fugashi is written with the assumption you'll use Unidic to process Japanese, but it supports arbitrary dictionaries.

If you're using a dictionary besides Unidic you can use the GenericTagger like this:

from fugashi import GenericTagger
tagger = GenericTagger()

# parse can be used as normal
tagger.parse('something')
# features from the dictionary can be accessed by field numbers
for word in tagger(text):
    print(word.surface, word.feature[0])

You can also create a dictionary wrapper to get feature information as a named tuple.

from fugashi import GenericTagger, create_feature_wrapper
CustomFeatures = create_feature_wrapper('CustomFeatures', 'alpha beta gamma')
tagger = GenericTagger(wrapper=CustomFeatures)
for word in tagger.parseToNodeList(text):
    print(word.surface, word.feature.alpha)

Citation

If you use fugashi in research, it would be appreciated if you cite this paper. You can read it at the ACL Anthology or on Arxiv.

@inproceedings{mccann-2020-fugashi,
    title = "fugashi, a Tool for Tokenizing {J}apanese in Python",
    author = "McCann, Paul",
    booktitle = "Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.nlposs-1.7",
    pages = "44--51",
    abstract = "Recent years have seen an increase in the number of large-scale multilingual NLP projects. However, even in such projects, languages with special processing requirements are often excluded. One such language is Japanese. Japanese is written without spaces, tokenization is non-trivial, and while high quality open source tokenizers exist they can be hard to use and lack English documentation. This paper introduces fugashi, a MeCab wrapper for Python, and gives an introduction to tokenizing Japanese.",
}

Alternatives

If you have a problem with fugashi feel free to open an issue. However, there are some cases where it might be better to use a different library.

If you don't want to deal with installing MeCab at all, try SudachiPy.
If you need to work with Korean, try KoNLPy.

License and Copyright Notice

fugashi is released under the terms of the MIT license. Please copy it far and wide.

fugashi is a wrapper for MeCab, and fugashi wheels include MeCab binaries. MeCab is copyrighted free software by Taku Kudo <[email protected]> and Nippon Telegraph and Telephone Corporation, and is redistributed under the BSD License.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 125

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (0) 🔗