All Projects → lucaskjaero → PyCasia

lucaskjaero / PyCasia

Licence: Apache-2.0 license
A python library to work with the CASIA Chinese handwriting database.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to PyCasia

Chinese Xinhua
📙 中华新华字典数据库。包括歇后语,成语,词语,汉字。
Stars: ✭ 8,705 (+22807.89%)
Mutual labels:  chinese, chinese-characters, chinese-simplified
Chineseutil
PHP 中文工具包,支持汉字转拼音、拼音分词、简繁互转、数字、金额大写;QQ群:17916227
Stars: ✭ 413 (+986.84%)
Mutual labels:  chinese, chinese-simplified
alfred-chinese-converter
支持 OpenCC 簡繁體中文詞彙級別轉換、異體字轉換以及地區習慣用詞轉換的 Alfred 2 workflow
Stars: ✭ 42 (+10.53%)
Mutual labels:  chinese, chinese-simplified
Opentracing Specification Zh
OpenTracing标准(中文版) `zh` (Chinese) translation of the opentracing/specification
Stars: ✭ 717 (+1786.84%)
Mutual labels:  chinese, chinese-simplified
Chinese Copywriting Guidelines
Chinese copywriting guidelines for better written communication/中文文案排版指北
Stars: ✭ 10,648 (+27921.05%)
Mutual labels:  chinese, chinese-simplified
PHP-Chinese
PHP Chinese Conversion (中文繁簡轉換)
Stars: ✭ 37 (-2.63%)
Mutual labels:  chinese, chinese-simplified
Icopy Site.github.io
icopy.site github mirror
Stars: ✭ 142 (+273.68%)
Mutual labels:  chinese, chinese-simplified
ark-pixel-font
Open source Pan-CJK pixel font / 开源的泛中日韩像素字体
Stars: ✭ 1,767 (+4550%)
Mutual labels:  chinese, chinese-simplified
hanzi-pinyin-font
Chinese font displaying Hanzi (汉字) characters with by transliteration/pronunciation (Pīnyīn).
Stars: ✭ 79 (+107.89%)
Mutual labels:  chinese, chinese-characters
eslint-config-mingelz
A shared ESLint configuration with Chinese comments. 一份带有完整中文注释的 ESLint 规则。
Stars: ✭ 15 (-60.53%)
Mutual labels:  chinese, chinese-simplified
react-flashcards
A simple React + Firebase flashcard application
Stars: ✭ 29 (-23.68%)
Mutual labels:  chinese, chinese-characters
LM-CNLC
Chinese Natural Language Correction via Language Model
Stars: ✭ 15 (-60.53%)
Mutual labels:  chinese
Printed-Chinese-Character-OCR
This is a Chinese Character ocr system based on Deep learning (VGG like CNN neural net work),this rep include trainning set generating,image preprocesing,NN model optimizing based on Keras high level NN framwork
Stars: ✭ 21 (-44.74%)
Mutual labels:  chinese-characters
myanbin.github.io
饮冰先生的博客
Stars: ✭ 32 (-15.79%)
Mutual labels:  chinese
pinyin4js
A opensource javascript library for converting chinese to pinyin。welcome Star : P
Stars: ✭ 153 (+302.63%)
Mutual labels:  chinese
ttskit
text to speech toolkit. 好用的中文语音合成工具箱,包含语音编码器、语音合成器、声码器和可视化模块。
Stars: ✭ 336 (+784.21%)
Mutual labels:  chinese
TV4Dialog
No description or website provided.
Stars: ✭ 33 (-13.16%)
Mutual labels:  chinese
FCH-TTS
A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型,适用于英语、普通话/中文、日语、韩语、俄语和藏语(当前已测试)。
Stars: ✭ 154 (+305.26%)
Mutual labels:  chinese
chinese-diceware
Diceware word lists in Chinese
Stars: ✭ 27 (-28.95%)
Mutual labels:  chinese
fandom-publics
The Chinese edition of The Internet and New Social Formation in China (粉丝公众), authored by Weiyu Zhang, translated by the CNPolitics translation team.
Stars: ✭ 62 (+63.16%)
Mutual labels:  chinese

PyCasia

Open source library to work with the CASIA Chinese Handwriting dataset.

Installation

PyCasia is on the Python Package Index, so installation is as easy as pip install pycasia. Requires Python 3.5 or newer.

Using the library

The pycasia.CASIA object is the interface for all the data. You can use it to explore the dataset, or use it as a base class for a more complicated use.

Datasets

Datasets are directories full of data files from a given distribution. They come in isolated character (GNT) or handwritten text (DGR) files. Four are automatically downloaded by the library, but you can add more.

These datasets are downloaded from the publicly available data hosted on the project webpage. You should expect a long download during your first run.

Included datasets

HWDB1.1trn_gnt_P1 and HWDB1.1trn_gnt_P2 are two parts of the publicly available set for training applications. They were split for easy downloading. HWDB1.1tst_gnt is the training portion of that set. competition-gnt is the dataset from some Chinese handwriting competitions.

Adding datasets

To add other datasets, add a new dictionary in the datasets variable of the CASIA object. You will need to include the download URL and the dataset type, either GNT or DGR. If you have data that isn't publicly available, make sure there is a folder named after the dataset in the base dataset directory, and the download code won't be called.

Example:

CASIA.datasets["competition-gnt"] = {
    "url": "http://www.nlpr.ia.ac.cn/databases/Download/competition/competition-gnt.zip",
    "type": "GNT"
}

Getting the data

You can download all datasets using the get_all_datasets() method, or just specific datasets using the get_dataset(dataset) method.

Dataset Location

On OS X and Linux, datasets are stored in ~/CASIA_data. On Windows, they're saved in the CASIA_data in your home directory. If you want to save the data in a diffent location, specify a path when you create the CASIA object. Eg: dataset = CASIA(path="/CASIA_data")

Using the data

You can load all of the character image (GNT) data using the load_character_images() method, or a particular dataset using the load_dataset(dataset) method. If you want to read the data on a file by file basis, just use the static CASIA.load_gnt_file() method to get the data.

These are generators yielding data as (image, label) pairs. The images are Pillow.Image.Image objects.

Getting raw data

You may want to explore the data by yourself. You can get the data as JPEGs by calling the get_raw() function. You can then inspect the data to your leisure.

Building your own interfaces on top of PyCasia

You can build your own class to implement more complicated usage of the dataset. Just inherit from CASIA.

Current status:

Early release. Features may change. Can open individual character images (GNT files) but not sentences. So far, no plans to develop readers to use DGR files or online datasets. Pull requests welcome.

Current Issues

Download issues

The datasets are hosted in mainland China, and are often difficult to download from other countries, as the connection gets reset. get_dataset attempts the download five times, but sometimes that doesn't work. You can try again, or download the data manually. WGET has been effective for manual downloads.

Limited dataset

While useful for many applications, the publicly available data is only a fraction of the total set. If you need more, you should fill out an application form from the projects maintainers to get the full set.

Copyright issues

The datasets are only licensed for research use, and certainly no commercial use. If you want to publish your data, you should fill out an application form from the projects maintainers. You should not host the data in any form, including in your repository.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].