All Projects → taptap → pinyin-plus

taptap / pinyin-plus

Licence: Apache-2.0 license
简繁体汉字转拼音的项目,解决多音字的问题。ElasticSearch、solr 的拼音分词工具

Programming Languages

java
68154 projects - #9 most used programming language

Projects that are alternatives of or similar to pinyin-plus

Elasticsearch Analysis Pinyin
This Pinyin Analysis plugin is used to do conversion between Chinese characters and Pinyin.
Stars: ✭ 2,215 (+2137.37%)
Mutual labels:  pinyin, pinyin-analysis
pinyin
an R package for converting Chineses characters into pinyin
Stars: ✭ 45 (-54.55%)
Mutual labels:  pinyin
Bigcidian
Pronunciation lexicon covering both English and Chinese languages for Automatic Speech Recognition.
Stars: ✭ 99 (+0%)
Mutual labels:  pinyin
Pinyinlite
Lightweight and Lightning-Fast ⚡️ Pinyin Library for JavaScript
Stars: ✭ 218 (+120.2%)
Mutual labels:  pinyin
React Native Search List
A searchable ListView which supports Chinese PinYin and alphabetical index.
Stars: ✭ 152 (+53.54%)
Mutual labels:  pinyin
Cnchar
好用小巧、功能全面的汉字简体 繁体 拼音 笔画js库
Stars: ✭ 251 (+153.54%)
Mutual labels:  pinyin
Hallelujahim
hallelujahIM(哈利路亚 英文输入法) is an intelligent English input method with auto-suggestions and spell check features, Mac only.
Stars: ✭ 1,334 (+1247.47%)
Mutual labels:  pinyin
chinese-rhymer
轻量中文押韵神器,100%绝对可用,傻瓜式命令行操作,秒速实现烈焰单押,闪电双押,龙卷三押以及海啸式四押,目前版本 v0.2.6。Search for rhymes for Chinese words, with 1, 2, 3 and 4 characters, released on PyPI with current version of 0.2.6.
Stars: ✭ 72 (-27.27%)
Mutual labels:  pinyin
langx-java
Java tools, helper, common utilities. A replacement of guava, apache-commons, hutool
Stars: ✭ 50 (-49.49%)
Mutual labels:  pinyin
Somiao Pinyin
Somiao Pinyin: Train your own Chinese Input Method with Seq2seq Model 搜喵拼音输入法
Stars: ✭ 209 (+111.11%)
Mutual labels:  pinyin
Chinese To Pinyin
一个将中文翻译成拼音的库
Stars: ✭ 199 (+101.01%)
Mutual labels:  pinyin
G2pc
g2pC: A Context-aware Grapheme-to-Phoneme Conversion module for Chinese
Stars: ✭ 155 (+56.57%)
Mutual labels:  pinyin
hanzi-pinyin-font
Chinese font displaying Hanzi (汉字) characters with by transliteration/pronunciation (Pīnyīn).
Stars: ✭ 79 (-20.2%)
Mutual labels:  pinyin
Gpy
Go 语言汉字转拼音工具
Stars: ✭ 136 (+37.37%)
Mutual labels:  pinyin
mahjong
开源中文分词工具包,中文分词Web API,Lucene中文分词,中英文混合分词
Stars: ✭ 40 (-59.6%)
Mutual labels:  pinyin
Cn sort
中文排序:按拼音/笔顺快速排序简体中文词组(百万数量级,可含中英/多音字)。如果对您有所帮助,欢迎点个star鼓励一下。
Stars: ✭ 102 (+3.03%)
Mutual labels:  pinyin
Lpinyin
Dart 汉字转拼音,Flutter, web, other
Stars: ✭ 239 (+141.41%)
Mutual labels:  pinyin
pinyin4js
A opensource javascript library for converting chinese to pinyin。welcome Star : P
Stars: ✭ 153 (+54.55%)
Mutual labels:  pinyin
pinyin data
🐼 Easy to use and portable pronunciation data for Hanzi characters.
Stars: ✭ 13 (-86.87%)
Mutual labels:  pinyin
jyut-dict
A free, open-source, offline Cantonese Dictionary for Windows, Mac, and Linux. Qt, SQLite. C++ and Python.
Stars: ✭ 67 (-32.32%)
Mutual labels:  pinyin

pinyin-plus

汉字转拼音的库,有如下特点

  • 拼音数据基于 cc-cedictkaifangcidian 开源词库
  • 基于拼音词库的数据初始化分词引擎进行分词,准确度高,解决多音字的问题
  • 支持繁体字
  • 支持自定义词库,词库格式同 cc-cedict 字典格式
  • api 简单,分为普通模式、索引模式

使用场景

汉字转拼音,常用于索引引擎场景创建拼音的索引,这个场景的问题一般由两种实现路径,一种是直接使用带拼音的的分词 插件,会自动帮你创建出拼音的索引,还有一种就是自己将汉字转换为拼音字符串,采用空格分隔分词来达到定制化索引的目的。 不论哪种实现路径,都离不开分词和拼音转换。pinyin-plus 的特点是,索引分词的词库和拼音的词库是基于同一套词库, 所以多音词的准确度特别高,而且词库的格式保留了开源词典的格式,词库可以轻松的定时更新。同时也预留了自定义词库的扩展 接口,保留定制化需求的高优先级

性能

#pinyin-plus 的压测数据,测试词语:率土之滨
kl@kldeMacBook-Pro-6 arthas % wrk -t16 -c100 -d15s --latency http://localhost:8080/%E7%8E%87%E5%9C%9F%E4%B9%8B%E6%BB%A8
Running 15s test @ http://localhost:8080/%E7%8E%87%E5%9C%9F%E4%B9%8B%E6%BB%A8
  16 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   733.97us  138.45us  16.40ms   96.12%
    Req/Sec     8.19k   293.50     8.90k    87.83%
  Latency Distribution
     50%  718.00us
     75%  739.00us
     90%  785.00us
     99%    1.02ms
  1970023 requests in 15.10s, 266.78MB read
Requests/sec: 130469.56
Transfer/sec:     17.67MB

添加依赖

gradle

compile "com.github.taptap:pinyin-plus:1.0"

maven

        <dependency>
            <groupId>com.github.taptap</groupId>
            <artifactId>pinyin-plus</artifactId>
            <version>1.0</version>
        </dependency>

使用

    //普通模式示例,汉字转换拼音后,单子采用空格隔开输出
    @Test
    void testToPinYin() {
        String pinyin = PinyinPlus.to("率土之滨");
        System.err.println(pinyin);
        Assertions.assertEquals("shuai tu zhi bin", pinyin);
    }
    //索引模式示例,汉字转换拼音后,词组采用空格隔开输出
    @Test
    void testToPinYin2() {
            String pinyin = PinyinPlus.toIndex("写的射雕英雄传");
            System.err.println(pinyin);
            Assertions.assertEquals("xie de shediaoyingxiongzhuan", pinyin);
    }
    

自定义词库

在项目 resources 目录下,新增 custom_cedict_ts.u8 文本文件,输入如下格式数据,# 开头的为注释,如:

#自定义词库
血花 血花 [xue4 hua1] //

格式保留和开源词库 cc-cedict 一样的风格,遇到相同的词组,自定义的优先级最高,会覆盖系统默认的词组

鸣谢

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].