All Projects → taishi-i → Toiro

taishi-i / Toiro

Licence: apache-2.0
A comparison tool of Japanese tokenizers

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Toiro

Pythainlp
Thai Natural Language Processing in Python.
Stars: ✭ 582 (+512.63%)
Mutual labels:  natural-language-processing, nlp-library, word-segmentation
Nagisa
A Japanese tokenizer based on recurrent neural networks
Stars: ✭ 260 (+173.68%)
Mutual labels:  japanese, nlp-library, word-segmentation
Vncorenlp
A Vietnamese natural language processing toolkit (NAACL 2018)
Stars: ✭ 354 (+272.63%)
Mutual labels:  natural-language-processing, word-segmentation
Pynlpl
PyNLPl, pronounced as 'pineapple', is a Python library for Natural Language Processing. It contains various modules useful for common, and less common, NLP tasks. PyNLPl can be used for basic tasks such as the extraction of n-grams and frequency lists, and to build simple language model. There are also more complex data types and algorithms. Moreover, there are parsers for file formats common in NLP (e.g. FoLiA/Giza/Moses/ARPA/Timbl/CQL). There are also clients to interface with various NLP specific servers. PyNLPl most notably features a very extensive library for working with FoLiA XML (Format for Linguistic Annotation).
Stars: ✭ 426 (+348.42%)
Mutual labels:  natural-language-processing, nlp-library
Sentencepiece
Unsupervised text tokenizer for Neural Network-based text generation.
Stars: ✭ 5,540 (+5731.58%)
Mutual labels:  natural-language-processing, word-segmentation
Jumanpp
Juman++ (a Morphological Analyzer Toolkit)
Stars: ✭ 254 (+167.37%)
Mutual labels:  japanese, word-segmentation
Chatbot ner
chatbot_ner: Named Entity Recognition for chatbots.
Stars: ✭ 273 (+187.37%)
Mutual labels:  natural-language-processing, nlp-library
Spacy
💫 Industrial-strength Natural Language Processing (NLP) in Python
Stars: ✭ 21,978 (+23034.74%)
Mutual labels:  natural-language-processing, nlp-library
Nlp profiler
A simple NLP library allows profiling datasets with one or more text columns. When given a dataset and a column name containing text data, NLP Profiler will return either high-level insights or low-level/granular statistical information about the text in that column.
Stars: ✭ 181 (+90.53%)
Mutual labels:  natural-language-processing, nlp-library
Youtokentome
Unsupervised text tokenizer focused on computational efficiency
Stars: ✭ 728 (+666.32%)
Mutual labels:  natural-language-processing, word-segmentation
Kuromoji
Kuromoji is a self-contained and very easy to use Japanese morphological analyzer designed for search
Stars: ✭ 745 (+684.21%)
Mutual labels:  japanese, nlp-library
Pykakasi
NLP: Convert Japanese Kana-kanji sentences into Kana-Roman in simple algorithm.
Stars: ✭ 238 (+150.53%)
Mutual labels:  japanese, natural-language-processing
Nagisa Tutorial Pycon2019
Code for PyCon JP 2019 talk "Python による日本語自然言語処理 〜系列ラベリングによる実世界テキスト分析〜"
Stars: ✭ 46 (-51.58%)
Mutual labels:  japanese, natural-language-processing
Lingua
👄 The most accurate natural language detection library for Java and the JVM, suitable for long and short text alike
Stars: ✭ 341 (+258.95%)
Mutual labels:  natural-language-processing, nlp-library
Konoha
🌿 An easy-to-use Japanese Text Processing tool, which makes it possible to switch tokenizers with small changes of code.
Stars: ✭ 130 (+36.84%)
Mutual labels:  japanese, natural-language-processing
Ekphrasis
Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).
Stars: ✭ 433 (+355.79%)
Mutual labels:  nlp-library, word-segmentation
Awesome Pytorch List
A comprehensive list of pytorch related content on github,such as different models,implementations,helper libraries,tutorials etc.
Stars: ✭ 12,475 (+13031.58%)
Mutual labels:  natural-language-processing, nlp-library
Fastnlp
fastNLP: A Modularized and Extensible NLP Framework. Currently still in incubation.
Stars: ✭ 2,441 (+2469.47%)
Mutual labels:  natural-language-processing, nlp-library
Kagome
Self-contained Japanese Morphological Analyzer written in pure Go
Stars: ✭ 554 (+483.16%)
Mutual labels:  japanese, nlp-library
Underthesea
Underthesea - Vietnamese NLP Toolkit
Stars: ✭ 823 (+766.32%)
Mutual labels:  natural-language-processing, nlp-library

toiro

Build Status Docker Cloud Build Status Python Package PyPI PyPI - Python Version

Toiro is a comparison tool of Japanese tokenizers.

  • Compare the processing speed of tokenizers
  • Compare the words segmented in tokenizers
  • Compare the performance of tokenizers by benchmarking application tasks (e.g., text classification)

It also provides useful functions for natural language processing in Japanese.

  • Data downloader for Japanese text corpora
  • Preprocessor of these corpora
  • Text classifier for Japanese text (e.g., SVM, BERT)

Installation

Python 3.6+ is required. You can install toiro with the following command. Janome is included in the default installation.

pip install toiro

Adding a tokenizer to toiro

If you want to add a tokenizer to toiro, please install it individually. This is an example of adding SudachiPy and nagisa to toiro.

pip install sudachipy sudachidict_core
pip install nagisa
How to install other tokenizers

mecab-python3

pip install mecab-python3==0.996.5

GiNZA

pip install spacy ginza

spaCy

pip install spacy[ja]

KyTea

You need to install KyTea. Please refer to here.

pip install kytea

Juman++ v2

You need to install Juman++ v2. Please refer to here.

pip install pyknp

SentencePiece

pip install sentencepiece

fugashi-ipadic

pip install fugashi ipadic

fugashi-unidic

pip install fugashi unidic-lite

tinysegmenter

pip install tinysegmenter3

If you want to install all the tokonizers at once, please use the following command.

pip install toiro[all_tokenizers]

Getting started

You can check the available tokonizers in your Python environment.

from toiro import tokenizers

available_tokenizers = tokenizers.available_tokenizers()
print(available_tokenizers)

Toiro supports 12 different Japanese tokonizers. This is an example of adding SudachiPy and nagisa.

{'nagisa': {'is_available': True, 'version': '0.2.7'},
 'janome': {'is_available': True, 'version': '0.3.10'},
 'mecab-python3': {'is_available': False, 'version': False},
 'sudachipy': {'is_available': True, 'version': '0.4.9'},
 'spacy': {'is_available': False, 'version': False},
 'ginza': {'is_available': False, 'version': False},
 'kytea': {'is_available': False, 'version': False},
 'jumanpp': {'is_available': False, 'version': False},
 'sentencepiece': {'is_available': False, 'version': False},
 'fugashi-ipadic': {'is_available': False, 'version': False},
 'fugashi-unidic': {'is_available': False, 'version': False},
 'tinysegmenter': {'is_available': False, 'version': False}}

Download the livedoor news corpus and compare the processing speed of tokenizers.

from toiro import tokenizers
from toiro import datadownloader

# A list of avaliable corpora in toiro
corpora = datadownloader.available_corpus()
print(corpora)
#=> ['livedoor_news_corpus', 'yahoo_movie_reviews', 'amazon_reviews']

# Download the livedoor news corpus and load it as pandas.DataFrame
corpus = corpora[0]
datadownloader.download_corpus(corpus)
train_df, dev_df, test_df = datadownloader.load_corpus(corpus)
texts = train_df[1]

# Compare the processing speed of tokenizers
report = tokenizers.compare(texts)
#=> [1/3] Tokenizer: janome
#=> 100%|███████████████████| 5900/5900 [00:07<00:00, 746.21it/s]
#=> [2/3] Tokenizer: nagisa
#=> 100%|███████████████████| 5900/5900 [00:15<00:00, 370.83it/s]
#=> [3/3] Tokenizer: sudachipy
#=> 100%|███████████████████| 5900/5900 [00:08<00:00, 696.68it/s]
print(report)
{'execution_environment': {'python_version': '3.7.8.final.0 (64 bit)',
  'arch': 'X86_64',
  'brand_raw': 'Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz',
  'count': 8},
 'data': {'number_of_sentences': 5900, 'average_length': 37.69593220338983},
 'janome': {'elapsed_time': 9.114670515060425},
 'nagisa': {'elapsed_time': 15.873093605041504},
 'sudachipy': {'elapsed_time': 9.05256724357605}}

# Compare the words segmented in tokenizers
text = "都庁所在地は新宿区。"
tokenizers.print_words(text, delimiter="|")
#=>        janome: 都庁|所在地|は|新宿|区|。
#=>        nagisa: 都庁|所在|地|は|新宿|区|。
#=>     sudachipy: 都庁|所在地|は|新宿区|。

Run toiro in Docker

You can use all tokenizers by building a docker container from Docker Hub.

docker run --rm -it taishii/toiro /bin/bash
How to run the Python interpreter in the Docker container

Run the Python interpreter.

[email protected]:/workspace# python3

Compare the words segmented in tokenizers

>>> from toiro import tokenizers
>>> text = "都庁所在地は新宿区。"
>>> tokenizers.print_words(text, delimiter="|")
 mecab-python3: 都庁|所在地||新宿||
        janome: 都庁|所在地||新宿||
        nagisa: 都庁|所在|||新宿||
     sudachipy: 都庁|所在地||新宿区|
         spacy: 都庁|所在|||新宿||
         ginza: 都庁|所在地||新宿区|
         kytea: 都庁|所在|||新宿||
       jumanpp: 都庁|所在|||新宿||
 sentencepiece: |||所在地||新宿||
fugashi-ipadic: 都庁|所在地||新宿||
fugashi-unidic: 都庁|所在|||新宿||
 tinysegmenter: 都庁所|在地||新宿||

Get more information about toiro

The slides at PyCon JP 2020

Tutorials in Japanese

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].