All Projects → wenet-e2e → WeTextProcessing

wenet-e2e / WeTextProcessing

Licence: Apache-2.0 license
Text Normalization & Inverse Text Normalization

Programming Languages

python
139335 projects - #7 most used programming language
C++
36643 projects - #6 most used programming language
CMake
9771 projects
c
50402 projects - #5 most used programming language

Projects that are alternatives of or similar to WeTextProcessing

TextDatasetCleaner
🔬 Очистка датасетов от мусора (нормализация, препроцессинг)
Stars: ✭ 27 (-87.32%)
Mutual labels:  text-processing, normalization
Regex Automata
A low level regular expression library that uses deterministic finite automata.
Stars: ✭ 203 (-4.69%)
Mutual labels:  text-processing
Prenlp
Preprocessing Library for Natural Language Processing
Stars: ✭ 130 (-38.97%)
Mutual labels:  text-processing
Textvec
Text vectorization tool to outperform TFIDF for classification tasks
Stars: ✭ 167 (-21.6%)
Mutual labels:  text-processing
Stanza Old
Stanford NLP group's shared Python tools.
Stars: ✭ 142 (-33.33%)
Mutual labels:  text-processing
Fastnlp
fastNLP: A Modularized and Extensible NLP Framework. Currently still in incubation.
Stars: ✭ 2,441 (+1046.01%)
Mutual labels:  text-processing
Libasciidoc
A Golang library for processing Asciidoc files.
Stars: ✭ 129 (-39.44%)
Mutual labels:  text-processing
rake-rs
Multilingual implementation of RAKE algorithm for Rust
Stars: ✭ 30 (-85.92%)
Mutual labels:  text-processing
Rust Unic
UNIC: Unicode and Internationalization Crates for Rust
Stars: ✭ 189 (-11.27%)
Mutual labels:  text-processing
Nlpre
Python library for Natural Language Preprocessing (NLPre)
Stars: ✭ 158 (-25.82%)
Mutual labels:  text-processing
Jaconv
Pure-Python Japanese character interconverter for Hiragana, Katakana, Hankaku and Zenkaku
Stars: ✭ 157 (-26.29%)
Mutual labels:  text-processing
Browsecloud
A web app to create and browse text visualizations for automated customer listening.
Stars: ✭ 143 (-32.86%)
Mutual labels:  text-processing
Sd
Intuitive find & replace CLI (sed alternative)
Stars: ✭ 2,755 (+1193.43%)
Mutual labels:  text-processing
Tmtoolkit
Text Mining and Topic Modeling Toolkit for Python with parallel processing power
Stars: ✭ 135 (-36.62%)
Mutual labels:  text-processing
Stringi
THE String Processing Package for R (with ICU)
Stars: ✭ 204 (-4.23%)
Mutual labels:  text-processing
Konoha
🌿 An easy-to-use Japanese Text Processing tool, which makes it possible to switch tokenizers with small changes of code.
Stars: ✭ 130 (-38.97%)
Mutual labels:  text-processing
Japanese.js
Util collection for Japanese text processing. Hiraganize, Katakanize, and Romanize.
Stars: ✭ 150 (-29.58%)
Mutual labels:  text-processing
Text Detector
Tool which allow you to detect and translate text.
Stars: ✭ 173 (-18.78%)
Mutual labels:  text-processing
text-analysis
Weaving analytical stories from text data
Stars: ✭ 12 (-94.37%)
Mutual labels:  text-processing
twitter-text-python
Twitter Text Libraries for Python
Stars: ✭ 22 (-89.67%)
Mutual labels:  text-processing

Text Normalization & Inverse Text Normalization

0. Brief Introduction

WeTextProcessing: Production First & Production Ready Text Processing Toolkit

0.1 Text Normalization

Cover

0.2 Inverse Text Normalization

Cover

1. How To Use

1.1 Quick Start:

# install
pip install WeTextProcessing
# tn usage
>>> from tn.chinese.normalizer import Normalizer
>>> normalizer = Normalizer()
>>> normalizer.normalize("2.5平方电线")
# itn usage
>>> from itn.chinese.inverse_normalizer import InverseNormalizer
>>> invnormalizer = InverseNormalizer()
>>> invnormalizer.normalize("二点五平方电线")

1.2 Advanced Usage:

DIY your own rules && Deploy WeTextProcessing with cpp runtime !!

For users who want modifications and adapt tn/itn rules to fix badcase, please try:

git clone https://github.com/wenet-e2e/WeTextProcessing.git
cd WeTextProcessing
# `overwrite_cache` will rebuild all rules according to
#   your modifications on tn/chinese/rules/xx.py (itn/chinese/rules/xx.py).
#   After rebuild, you can find new far files at `$PWD/tn` and `$PWD/itn`.
python normalize.py --text "2.5平方电线" --overwrite_cache
python inverse_normalize.py --text "二点五平方电线" --overwrite_cache

Once you successfully rebuild your rules, you can deploy them either with your installed pypi packages:

# tn usage
>>> from tn.chinese.normalizer import Normalizer
>>> normalizer = Normalizer(cache_dir="PATH_TO_GIT_CLONED_WETEXTPROCESSING/tn")
>>> normalizer.normalize("2.5平方电线")
# itn usage
>>> from itn.chinese.inverse_normalizer import InverseNormalizer
>>> invnormalizer = InverseNormalizer(cache_dir="PATH_TO_GIT_CLONED_WETEXTPROCESSING/itn")
>>> invnormalizer.normalize("二点五平方电线")

Or with cpp runtime:

cmake -B build -S runtime -DCMAKE_BUILD_TYPE=Release
cmake --build build
# tn usage
cache_dir=PATH_TO_GIT_CLONED_WETEXTPROCESSING/tn
./build/processor_main --tagger $cache_dir/zh_tn_tagger.fst --verbalizer $cache_dir/zh_tn_verbalizer.fst --text "2.5平方电线"
# itn usage
cache_dir=PATH_TO_GIT_CLONED_WETEXTPROCESSING/itn
./build/processor_main --tagger $cache_dir/zh_itn_tagger.fst --verbalizer $cache_dir/zh_itn_verbalizer.fst --text "二点五平方电线"

2. TN Pipeline

Please refer to TN.README

3. ITN Pipeline

Please refer to ITN.README

Discussion & Communication

For Chinese users, you can aslo scan the QR code on the left to follow our offical account of WeNet. We created a WeChat group for better discussion and quicker response. Please scan the personal QR code on the right, and the guy is responsible for inviting you to the chat group.

Or you can directly discuss on Github Issues.

Acknowledge

  1. Thank the authors of foundational libraries like OpenFst & Pynini.
  2. Thank NeMo team & NeMo open-source community.
  3. Thank Zhenxiang Ma, Jiayu Du, and SpeechColab organization.
  4. Referred Pynini for reading the FAR, and printing the shortest path of a lattice in the C++ runtime.
  5. Referred TN of NeMo for the data to build the tagger graph.
  6. Referred ITN of chinese_text_normalization for the data to build the tagger graph.
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].