Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → speechio → Chinese_text_normalization

speechio / Chinese_text_normalization

Licence: mit

Chinese text normalization for speech processing

Programming Languages

139335 projects - #7 most used programming language

Labels

chinese speech-recognition asr

Projects that are alternatives of or similar to Chinese text normalization

PyTorch Implementations for End-to-End Automatic Speech Recognition

Stars: ✭ 106 (-56.2%)

Mutual labels: speech-recognition, asr

Asr audio data links

A list of publically available audio data that anyone can download for ASR or other speech activities

Stars: ✭ 128 (-47.11%)

Mutual labels: speech-recognition, asr

Deepspeechrecognition

A Chinese Deep Speech Recognition System 包括基于深度学习的声学模型和基于深度学习的语言模型

Stars: ✭ 1,421 (+487.19%)

Mutual labels: speech-recognition, asr

Working online speech recognition based on RNN Transducer. ( Trained model release available in release )

Stars: ✭ 205 (-15.29%)

Mutual labels: speech-recognition, asr

End2end Asr Pytorch

End-to-End Automatic Speech Recognition on PyTorch

Stars: ✭ 175 (-27.69%)

Mutual labels: speech-recognition, asr

DELTA is a deep learning based natural language and speech processing platform.

Stars: ✭ 1,479 (+511.16%)

Mutual labels: speech-recognition, asr

ASR with PyTorch

Stars: ✭ 124 (-48.76%)

Mutual labels: speech-recognition, asr

Speech Recognition model based off of FAIR research paper built using Pytorch.

Stars: ✭ 78 (-67.77%)

Mutual labels: speech-recognition, asr

pytorch-kaldi is a project for developing state-of-the-art DNN/RNN hybrid speech recognition systems. The DNN part is managed by pytorch, while feature extraction, label computation, and decoding are performed with the kaldi toolkit.

Stars: ✭ 2,097 (+766.53%)

Mutual labels: speech-recognition, asr

Some simple wrappers around kaldi-asr intended to make using kaldi's (online) decoders as convenient as possible.

Stars: ✭ 156 (-35.54%)

Mutual labels: speech-recognition, asr

Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node

Stars: ✭ 1,357 (+460.74%)

Mutual labels: speech-recognition, asr

Open-Source Toolkit for End-to-End Korean Automatic Speech Recognition.

Stars: ✭ 190 (-21.49%)

Mutual labels: speech-recognition, asr

Mongolian Speech Recognition

Mongolian speech recognition with PyTorch

Stars: ✭ 97 (-59.92%)

Mutual labels: speech-recognition, asr

Pronunciation lexicon covering both English and Chinese languages for Automatic Speech Recognition.

Stars: ✭ 99 (-59.09%)

Mutual labels: speech-recognition, asr

Ktspeechcrawler

Automatically constructing corpus for automatic speech recognition from YouTube videos

Stars: ✭ 92 (-61.98%)

Mutual labels: speech-recognition, asr

MXNet implementation of RNN Transducer (Graves 2012): Sequence Transduction with Recurrent Neural Networks

Stars: ✭ 114 (-52.89%)

Mutual labels: speech-recognition, asr

A pytorch based end2end speech recognition system.

Stars: ✭ 69 (-71.49%)

Mutual labels: speech-recognition, asr

Program to benchmark various speech recognition APIs

Stars: ✭ 71 (-70.66%)

Mutual labels: speech-recognition, asr

Speech To Text Russian

Проект для распознавания речи на русском языке на основе pykaldi.

Stars: ✭ 151 (-37.6%)

Mutual labels: speech-recognition, asr

Python module for evaluating ASR hypotheses (e.g. word error rate, word recognition rate).

Stars: ✭ 190 (-21.49%)

Mutual labels: speech-recognition, asr

View All Similar Projects ➔

Chinese Text Normalization for Speech Processing

Problem

Search for "Text Normalization"(TN) on Google and Github, you can hardly find open-source projects that are "read-to-use" for text normalization tasks. Instead, you find a bunch of NLP toolkits or frameworks that supports TN functionality. There is quite some work between "support text normalization" and "do text normalization".

Reason

TN is language-dependent, more or less.

Some of TN processing methods are shared across languages, but a good TN module always involves language-specific knowledge and treatments, more or less.
TN is task-specific.

Even for the same language, different applications require quite different TN.
TN is "dirty"

Constructing and maintaining a set of TN rewrite-rules is painful, whatever toolkits and frameworks you choose. Subtle and intrinsic complexities hide inside TN task itself, not in tools or frameworks.
mature TN module is an asset

Since constructing and maintaining TN is hard, it is actually an asset for commercial companies, hence it is unlikely to find a product-level TN in open-source community (correct me if you find any)
TN is a less important topic for either academic or commercials.

Goal

This project sets up a ready-to-use TN module for Chinese. Since my background is speech processing, this project should be able to handle most common TN tasks, in Chinese ASR text processing pipelines.

Normalizers

supported NSW (Non-Standard-Word) Normalization

NSW type	raw	normalized
cardinal	这块黄金重达324.75克	这块黄金重达三百二十四点七五克
date	她出生于86年8月18日，她弟弟出生于1995年3月1日	她出生于八六年八月十八日她弟弟出生于一九九五年三月一日
digit	电影中梁朝伟扮演的陈永仁的编号27149	电影中梁朝伟扮演的陈永仁的编号二七一四九
fraction	现场有7/12的观众投出了赞成票	现场有十二分之七的观众投出了赞成票
money	随便来几个价格12块5，34.5元，20.1万	随便来几个价格十二块五三十四点五元二十点一万
percentage	明天有62％的概率降雨	明天有百分之六十二的概率降雨
telephone	这是固话0421-33441122 这是手机+86 18544139121	这是固话零四二一三三四四一一二二这是手机八六一八五四四一三九一二一

acknowledgement: the NSW normalization codes are based on Zhiyang Zhou's work here

punctuation removal

For Chinese, it removes punctuation list collected in Zhon project, containing

non-stop puncs

'＂＃＄％＆＇（）＊＋，－／：；＜＝＞＠［＼］＾＿｀｛｜｝～｟｠｢｣､、〃》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〾〿–—‘’‛“”„‟…‧﹏'

stop puncs
```
'！？｡。'
```

For English, it removes Python's string.punctuation

multilingual English word upper/lower case conversion since ASR/TTS lexicons usually unify English entries to uppercase or lowercase, the TN module should adapt with lexicon accordingly.

Supported text format

plain text, preferably one sentence per line(most common case in ASR processing).
```
今天早饭吃了没
没吃回家吃去吧
...
```
plain text is default format.
Kaldi's transcription format
```
KALDI_KEY_UTT001    今天早饭吃了没
KALDI_KEY_UTT002    没吃回家吃去吧
...
```
TN will skip first column key section, normalize latter transcription text

pass --has_key option to switch to kaldi format.

note: All input text should be UTF-8 encoded.

Run examples

TN (python)

make sure you have python3, python2.X won't work correctly.

sh run.sh in TN dir, and compare raw text and normalized text.

ITN (thrax)

make sure you have thrax installed, and your PATH should be able to find thrax binaries.

sh run.sh in ITN dir. check Makefile for grammar dependency.

possible future work

Since TN is a typical "done is better than perfect" module in context of ASR, and the current state is sufficient for my purpose, I probably won't update this repo frequently.

there are indeed something that needs to be improved:

For TN, NSW normalizers in TN dir are based on regular expression, I've found some unintended matches, those pattern regexps need to be refined for more precise TN coverage.
For ITN, extend those thrax rewriting grammars to cover more scenarios.
Further more, nowadays commercial systems start to introduce RNN-like models into TN, and a mix of (rule-based & model-based) system is state-of-the-art. More readings about this, look for Richard Sproat and KyleGorman's work at Google.

END

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 242

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (2) 🔗