Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → wenet-e2e → WeTextProcessing

wenet-e2e / WeTextProcessing

Licence: Apache-2.0 license

Text Normalization & Inverse Text Normalization

Programming Languages

139335 projects - #7 most used programming language

36643 projects - #6 most used programming language

9771 projects

50402 projects - #5 most used programming language

Labels

text-processing production-ready normalization

Projects that are alternatives of or similar to WeTextProcessing

TextDatasetCleaner

🔬 Очистка датасетов от мусора (нормализация, препроцессинг)

Stars: ✭ 27 (-87.32%)

Mutual labels: text-processing, normalization

A low level regular expression library that uses deterministic finite automata.

Stars: ✭ 203 (-4.69%)

Mutual labels: text-processing

Preprocessing Library for Natural Language Processing

Stars: ✭ 130 (-38.97%)

Mutual labels: text-processing

Text vectorization tool to outperform TFIDF for classification tasks

Stars: ✭ 167 (-21.6%)

Mutual labels: text-processing

Stanford NLP group's shared Python tools.

Stars: ✭ 142 (-33.33%)

Mutual labels: text-processing

fastNLP: A Modularized and Extensible NLP Framework. Currently still in incubation.

Stars: ✭ 2,441 (+1046.01%)

Mutual labels: text-processing

A Golang library for processing Asciidoc files.

Stars: ✭ 129 (-39.44%)

Mutual labels: text-processing

Multilingual implementation of RAKE algorithm for Rust

Stars: ✭ 30 (-85.92%)

Mutual labels: text-processing

UNIC: Unicode and Internationalization Crates for Rust

Stars: ✭ 189 (-11.27%)

Mutual labels: text-processing

Python library for Natural Language Preprocessing (NLPre)

Stars: ✭ 158 (-25.82%)

Mutual labels: text-processing

Pure-Python Japanese character interconverter for Hiragana, Katakana, Hankaku and Zenkaku

Stars: ✭ 157 (-26.29%)

Mutual labels: text-processing

A web app to create and browse text visualizations for automated customer listening.

Stars: ✭ 143 (-32.86%)

Mutual labels: text-processing

Intuitive find & replace CLI (sed alternative)

Stars: ✭ 2,755 (+1193.43%)

Mutual labels: text-processing

Text Mining and Topic Modeling Toolkit for Python with parallel processing power

Stars: ✭ 135 (-36.62%)

Mutual labels: text-processing

THE String Processing Package for R (with ICU)

Stars: ✭ 204 (-4.23%)

Mutual labels: text-processing

🌿 An easy-to-use Japanese Text Processing tool, which makes it possible to switch tokenizers with small changes of code.

Stars: ✭ 130 (-38.97%)

Mutual labels: text-processing

Util collection for Japanese text processing. Hiraganize, Katakanize, and Romanize.

Stars: ✭ 150 (-29.58%)

Mutual labels: text-processing

Tool which allow you to detect and translate text.

Stars: ✭ 173 (-18.78%)

Mutual labels: text-processing

Weaving analytical stories from text data

Stars: ✭ 12 (-94.37%)

Mutual labels: text-processing

twitter-text-python

Twitter Text Libraries for Python

Stars: ✭ 22 (-89.67%)

Mutual labels: text-processing

View All Similar Projects ➔

Text Normalization & Inverse Text Normalization

0. Brief Introduction

WeTextProcessing: Production First & Production Ready Text Processing Toolkit

0.1 Text Normalization

Cover

0.2 Inverse Text Normalization

Cover

1. How To Use

1.1 Quick Start:

# install
pip install WeTextProcessing

# tn usage
>>> from tn.chinese.normalizer import Normalizer
>>> normalizer = Normalizer()
>>> normalizer.normalize("2.5平方电线")
# itn usage
>>> from itn.chinese.inverse_normalizer import InverseNormalizer
>>> invnormalizer = InverseNormalizer()
>>> invnormalizer.normalize("二点五平方电线")

1.2 Advanced Usage:

DIY your own rules && Deploy WeTextProcessing with cpp runtime !!

For users who want modifications and adapt tn/itn rules to fix badcase, please try:

git clone https://github.com/wenet-e2e/WeTextProcessing.git
cd WeTextProcessing
# `overwrite_cache` will rebuild all rules according to
#   your modifications on tn/chinese/rules/xx.py (itn/chinese/rules/xx.py).
#   After rebuild, you can find new far files at `$PWD/tn` and `$PWD/itn`.
python normalize.py --text "2.5平方电线" --overwrite_cache
python inverse_normalize.py --text "二点五平方电线" --overwrite_cache

Once you successfully rebuild your rules, you can deploy them either with your installed pypi packages:

# tn usage
>>> from tn.chinese.normalizer import Normalizer
>>> normalizer = Normalizer(cache_dir="PATH_TO_GIT_CLONED_WETEXTPROCESSING/tn")
>>> normalizer.normalize("2.5平方电线")
# itn usage
>>> from itn.chinese.inverse_normalizer import InverseNormalizer
>>> invnormalizer = InverseNormalizer(cache_dir="PATH_TO_GIT_CLONED_WETEXTPROCESSING/itn")
>>> invnormalizer.normalize("二点五平方电线")

Or with cpp runtime:

cmake -B build -S runtime -DCMAKE_BUILD_TYPE=Release
cmake --build build
# tn usage
cache_dir=PATH_TO_GIT_CLONED_WETEXTPROCESSING/tn
./build/processor_main --tagger $cache_dir/zh_tn_tagger.fst --verbalizer $cache_dir/zh_tn_verbalizer.fst --text "2.5平方电线"
# itn usage
cache_dir=PATH_TO_GIT_CLONED_WETEXTPROCESSING/itn
./build/processor_main --tagger $cache_dir/zh_itn_tagger.fst --verbalizer $cache_dir/zh_itn_verbalizer.fst --text "二点五平方电线"

2. TN Pipeline

Please refer to TN.README

3. ITN Pipeline

Please refer to ITN.README

Discussion & Communication

For Chinese users, you can aslo scan the QR code on the left to follow our offical account of WeNet. We created a WeChat group for better discussion and quicker response. Please scan the personal QR code on the right, and the guy is responsible for inviting you to the chat group.

Or you can directly discuss on Github Issues.

Acknowledge

Thank the authors of foundational libraries like OpenFst & Pynini.
Thank NeMo team & NeMo open-source community.
Thank Zhenxiang Ma, Jiayu Du, and SpeechColab organization.
Referred Pynini for reading the FAR, and printing the shortest path of a lattice in the C++ runtime.
Referred TN of NeMo for the data to build the tagger graph.
Referred ITN of chinese_text_normalization for the data to build the tagger graph.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 213

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (1) 🔗