All Projects → Kyubyong → G2pc

Kyubyong / G2pc

Licence: apache-2.0
g2pC: A Context-aware Grapheme-to-Phoneme Conversion module for Chinese

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to G2pc

Fancy Nlp
NLP for human. A fast and easy-to-use natural language processing (NLP) toolkit, satisfying your imagination about NLP.
Stars: ✭ 233 (+50.32%)
Mutual labels:  crf, chinese-nlp
Lac
百度NLP:分词,词性标注,命名实体识别,词重要性
Stars: ✭ 2,792 (+1701.29%)
Mutual labels:  chinese-nlp, chinese-word-segmentation
Nlp4han
中文自然语言处理工具集【断句/分词/词性标注/组块/句法分析/语义分析/NER/N元语法/HMM/代词消解/情感分析/拼写检查】
Stars: ✭ 206 (+32.9%)
Mutual labels:  chinese-nlp, chinese-word-segmentation
mahjong
开源中文分词工具包,中文分词Web API,Lucene中文分词,中英文混合分词
Stars: ✭ 40 (-74.19%)
Mutual labels:  crf, pinyin
berserker
Berserker - BERt chineSE woRd toKenizER
Stars: ✭ 17 (-89.03%)
Mutual labels:  chinese-nlp, chinese-word-segmentation
Chinese semantic role labeling
基于 Bi-LSTM 和 CRF 的中文语义角色标注
Stars: ✭ 60 (-61.29%)
Mutual labels:  crf, chinese-nlp
Jcseg
Jcseg is a light weight NLP framework developed with Java. Provide CJK and English segmentation based on MMSEG algorithm, With also keywords extraction, key sentence extraction, summary extraction implemented based on TEXTRANK algorithm. Jcseg had a build-in http server and search modules for the latest lucene,solr,elasticsearch
Stars: ✭ 754 (+386.45%)
Mutual labels:  chinese-nlp, chinese-word-segmentation
Chinesenlp
Datasets, SOTA results of every fields of Chinese NLP
Stars: ✭ 1,206 (+678.06%)
Mutual labels:  chinese-nlp, chinese-word-segmentation
Daguan 2019 rank9
datagrand 2019 information extraction competition rank9
Stars: ✭ 121 (-21.94%)
Mutual labels:  crf
Ncrfpp
NCRF++, a Neural Sequence Labeling Toolkit. Easy use to any sequence labeling tasks (e.g. NER, POS, Segmentation). It includes character LSTM/CNN, word LSTM/CNN and softmax/CRF components.
Stars: ✭ 1,767 (+1040%)
Mutual labels:  crf
Pydensecrf
Python wrapper to Philipp Krähenbühl's dense (fully connected) CRFs with gaussian edge potentials.
Stars: ✭ 1,633 (+953.55%)
Mutual labels:  crf
Ner
命名体识别(NER)综述-论文-模型-代码(BiLSTM-CRF/BERT-CRF)-竞赛资源总结-随时更新
Stars: ✭ 118 (-23.87%)
Mutual labels:  crf
Gpy
Go 语言汉字转拼音工具
Stars: ✭ 136 (-12.26%)
Mutual labels:  pinyin
Nlpcc Wordseg Weibo
NLPCC 2016 微博分词评测项目
Stars: ✭ 120 (-22.58%)
Mutual labels:  chinese-word-segmentation
Ner Slot filling
中文自然语言的实体抽取和意图识别(Natural Language Understanding),可选Bi-LSTM + CRF 或者 IDCNN + CRF
Stars: ✭ 151 (-2.58%)
Mutual labels:  crf
Thulac Python
An Efficient Lexical Analyzer for Chinese
Stars: ✭ 1,619 (+944.52%)
Mutual labels:  chinese-nlp
Crfsharp
CRFSharp is Conditional Random Fields implemented by .NET(C#), a machine learning algorithm for learning from labeled sequences of examples.
Stars: ✭ 110 (-29.03%)
Mutual labels:  crf
React Native Search List
A searchable ListView which supports Chinese PinYin and alphabetical index.
Stars: ✭ 152 (-1.94%)
Mutual labels:  pinyin
Information Extraction Chinese
Chinese Named Entity Recognition with IDCNN/biLSTM+CRF, and Relation Extraction with biGRU+2ATT 中文实体识别与关系提取
Stars: ✭ 1,888 (+1118.06%)
Mutual labels:  chinese-nlp
Id Cnn Cws
Source codes and corpora of paper "Iterated Dilated Convolutions for Chinese Word Segmentation"
Stars: ✭ 129 (-16.77%)
Mutual labels:  crf

image image image

g2pC: A Context-aware Grapheme-to-Phoneme for Chinese

There are several open source libraries of Chinese grapheme-to-phoneme conversion such as python-pinyin or xpinyin. However, none of them seem to disambiguate Chinese polyphonic words like "行" ("xíng" (go, walk) vs. "háng" (line)) or "了" ("le" (completed action marker) vs. "liǎo" (finish, achieve)). Instead, they pick up the most frequent pronunciation. Although that may be a simple and economic strategy, machine learning techniques can be of help. We use CRF to determine the pronunciation of polyphonic words. In addition to the target word itself and its part-of-speech, which are tagged by pkuseg, its neighboring words are also featurized.

Requirements

  • python >= 3.6
  • pkuseg
  • sklearn_crfsuite

Installation

pip install g2pc

Main Features

  • Disambiguate polyphonic Chinese characters/words and return the most likely pinyin in the context using CRF implemented with sklearn_crfsuite.
  • By associating segmentation results provided by pkuseg with an open-source dictionary CC-CEDICT, display the following comprehensive information.
    • word
    • part-of-speech
    • pinyin
    • descriptive pinyin: where Chinese tone change rules are applied
    • English meaning
    • traditional equivalent

Algorithm (illustrated with an example)

e.g., Input: 我写了几行代码。 (I wrote a few lines of codes.)

  • STEP 1. Segment input string using pkuseg.

    • -> [('我', 'r'), ('写', 'v'), ('了', 'u'), ('几', 'm'), ('行', 'q'), ('代码', 'n'), ('。', 'w')]
  • STEP 2. Look up the CC-CEDICT. Each token, a tuple, consists of word, pos, pronunciation candidates, meaning candidates, traditional character candidates.

    • -> [('我', 'r', ['wo3'], ['/I/me/my/'], ['我']),
      ('写', 'v', ['xie3'], ['/to write/'], ['寫']),
      ('了', 'u', ['le5', 'liao3', 'liao4'], [dal particle ..], ['了', '了', '瞭']),
      ('几', 'm', ['ji3', 'ji1'], ['/how much/..'], ['幾', '几']),
      ('行', 'q', ['xing2', 'hang2'], ['/to walk/.."], ['行', '行']),
      ('代码', 'n', ['dai4 ma3'], ['/code/'], ['代碼']),
      ('。', 'w', ['。'], [''], ['。'])]
  • STEP 3. For polyphonic words, we disambiguate them, using our pre-trained CRF model.

    • -> [('我', 'r', 'wo3', '/I/me/my/', '我'),
      ('写', 'v', 'xie3', '/to write/', '寫'),
      ('了', 'u', 'le5', '/(modal particle ..', '了'),
      ('几', 'm', 'ji3', '/how much/..', '幾'),
      ('行', 'q', 'hang2', "/row/..", '行'),
      ('代码', 'n', 'dai4 ma3', '/code/', '代碼'),
      ('。', 'w', '。', '。', '', '。')]
  • STEP 4. Tone change rules are applied.

    • -> [('我', 'r', 'wo3', 'wo2', '/I/me/my/', '我'),
      ('写', 'v', 'xie3', 'xie3', '/to write/', '寫'),
      ('了', 'u', 'le5', 'le5', '/(modal particle ..', '了'),
      ('几', 'm', 'ji3', 'ji3', '/how much/..', '幾'),
      ('行', 'q', 'hang2', 'hang2, "/row/..", '行'),
      ('代码', 'n', 'dai4 ma3', 'dai4 ma3', '/code/', '代碼'),
      ('。', 'w', '。', '。', '', '。')]

Usage

>>> from g2pc import G2pC
>>> g2p = G2pC()
>>> g2p("一心一意")
# This returns a list of tuples, each of which consists of
# word, pos, pinyin, (tone changed) descriptive pinyin, English meaning, and equivanlent traditional character.
[[('一心一意', 
'i', 
'yi1 xin1 yi1 yi4', 
'yi4 xin1 yi2 yi4', 
"/concentrating one's thoughts and efforts/single-minded/bent on/intently/", 
'一心一意')]

Respectful comparison with other libraries

>>> text1 = "我写了几行代码。" # pay attention to the 行, which should be read as 'hang2', not 'xing2'
>>> text2 = "来不了" # pay attention to the 了, which should be read as 'liao3', not 'le'
# python-pinyin
>>> pip install pypinyin
>>> from pypinyin import pinyin
>>> pinyin(text1)
[['wǒ'], ['xiě'], ['le'], ['jǐ'], ['xíng'], ['dài'], ['mǎ'], ['。']]
>>> pinyin(text2)
[['lái'], ['bù'], ['le']]
# xpinyin
>>> pip install xpinyin
>>> from xpinyin import Pinyin
>>> p = Pinyin()
>>> p.get_pinyin(text1, tone_marks="numbers")  
'wo3-xie3-le5-ji1-xing2-dai4-ma3-。'
>>> p.get_pinyin(text2, tone_marks="numbers")   
'lai2-bu4-le5'
  • Accuracy on internal test set (13,191 syllables)
Model # Correct # Incorrect Acc. (%)
g2pC (0.9.9.3) 13,033 158 98.80
pypinyin (0.35.3) 12,975 216 98.36
xpinyin (0.5.6) 12,838 353 97.32

Accuracy

Changelog

0.9.9.3 July 10, 2019

  • Refined the tone change rules.

0.9.9.2 July 10, 2019

  • Refined the cedict.pkl.

0.9.9.1 July 9, 2019

  • Fixed a bug of failing to find Chinese characters for names. (See this)

0.9.6. July 7, 2019

  • Fixed a bug of failing to converting words not found in the dictionary.
  • Rearragned the cedict.pkl.
  • Refined the CRF model.
  • Added tone change rules. (See this)

0.9.4. July 4, 2019

  • Initial launch

References

If you use our software for research, please cite:

@misc{gp2C2019,
  author = {Park, Kyubyong},
  title = {g2pC},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/Kyubyong/g2pC}}
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].