All Projects → ye-kyaw-thu → sylbreak

ye-kyaw-thu / sylbreak

Licence: Apache-2.0 license
Syllable segmentation tool for Myanmar language (Burmese) by Ye.

Programming Languages

HTML
75241 projects
Jupyter Notebook
11667 projects
python
139335 projects - #7 most used programming language
shell
77523 projects
perl
6916 projects
java
68154 projects - #9 most used programming language

Projects that are alternatives of or similar to sylbreak

TALPCo
TUFS Asian Language Parallel Corpus
Stars: ✭ 32 (-27.27%)
Mutual labels:  myanmar, burmese
kuu-pyaung
Laravel package to convert files and database from zawgyi to unicode.
Stars: ✭ 13 (-70.45%)
Mutual labels:  myanmar
readable-regex
Java library for creating readable regular expressions
Stars: ✭ 24 (-45.45%)
Mutual labels:  regular-expressions
skt
Sanskrit compound segmentation using seq2seq model
Stars: ✭ 21 (-52.27%)
Mutual labels:  word-segmentation
myG2P
Myanmar (Burmese) Language Grapheme to Phoneme (myG2P) Conversion Dictionary for speech recognition (ASR) and speech synthesis (TTS).
Stars: ✭ 43 (-2.27%)
Mutual labels:  myanmar
expressive-ts
A functional programming library designed to simplify building complex regular expressions
Stars: ✭ 78 (+77.27%)
Mutual labels:  regular-expressions
word tokenize
Vietnamese Word Tokenize
Stars: ✭ 45 (+2.27%)
Mutual labels:  word-segmentation
convey
CSV processing and web related data types mutual conversion
Stars: ✭ 16 (-63.64%)
Mutual labels:  regular-expressions
sentencepiece-jni
Java JNI wrapper for SentencePiece: unsupervised text tokenizer for Neural Network-based text generation.
Stars: ✭ 26 (-40.91%)
Mutual labels:  word-segmentation
simplematch
Minimal, super readable string pattern matching for python.
Stars: ✭ 147 (+234.09%)
Mutual labels:  regular-expressions
ckipnlp
CKIP CoreNLP Toolkits
Stars: ✭ 92 (+109.09%)
Mutual labels:  word-segmentation
sentencepiece
R package for Byte Pair Encoding / Unigram modelling based on Sentencepiece
Stars: ✭ 22 (-50%)
Mutual labels:  word-segmentation
tokenquery
TokenQuery (regular expressions over tokens)
Stars: ✭ 28 (-36.36%)
Mutual labels:  regular-expressions
spell
Spelling correction and string segmentation written in Go
Stars: ✭ 24 (-45.45%)
Mutual labels:  word-segmentation
python-hyperscan
A CPython extension for the Hyperscan regular expression matching library.
Stars: ✭ 112 (+154.55%)
Mutual labels:  regular-expressions
Ruby Regexp
Learn Ruby Regexp step by step from beginner to advanced levels with plenty of examples and exercises
Stars: ✭ 79 (+79.55%)
Mutual labels:  regular-expressions
url-regex-safe
Regular expression matching for URL's. Maintained, safe, and browser-friendly version of url-regex. Resolves CVE-2020-7661 for Node.js servers.
Stars: ✭ 59 (+34.09%)
Mutual labels:  regular-expressions
Python
covers python basic to advance topics, practice questions, logical problems in python, web development using html, css, bootstrap, jquery, DOM, Django 🚀🚀. 💥 🌈
Stars: ✭ 29 (-34.09%)
Mutual labels:  regular-expressions
pytorch Joint-Word-Segmentation-and-POS-Tagging
Paper: A Simple and Effective Neural Model for Joint Word Segmentation and POS Tagging
Stars: ✭ 37 (-15.91%)
Mutual labels:  word-segmentation
SulfurKeyboard
Android Gingerbread Keyboard with Myanmar(Zawgyi-one) Language
Stars: ✭ 18 (-59.09%)
Mutual labels:  myanmar

sylbreak

Myanmar language (Burmese) README

Syllable segmenation is an important preprocess for many natural language processing (NLP) such as romanization, transliteration and graphame-to-phoneme (g2p) conversion.

"sylbreak" is a syllable segmentation tool for Myanmar language (Burmese) text encoded with Unicode (e.g. Myanmar3, Padauk). I used only one short line of regular expression (RE) as follow:

$line =~ s/((?<!$ssSymbol)[$myConsonant](?![$aThat$ssSymbol])|[$enChar$otherChar])/$sep$1/g;

Here, the point is (a consonant not after a subscript symbol AND not followed by a-That character or a subscript symbol)

Here, variables are declared as follows:

my $myConsonant = "က-အ";
my $enChar = "a-zA-Z0-9";
my $otherChar = "ဣဤဥဦဧဩဪဿ၌၍၏၀-၉၊။!-\/:-\@\[-`{-~\\s";
my $ssSymbol = "";
my $aThat = "";

Visualization of sylbreak RE

Fig. Visualization of sylbreak RE

If you use shell (sylbreak.sh), perl (sylbreak.pl) and python (sylbreak.py) scripts, no need to make installation. I plan to update/code sylbreak with some more programming languages such as C++, Ruby in the near future.

Enjoy syllable breaking!

Ye@Lab

Acknowledgement

Thanks to Swan Htet Aung who informed my typo mistake of $otherChar ... ဥဥ ---> ဥဦ
sylbreak RE example programs for Java and Java Script was written by Chan Mrate Ko Ko.

Reference

  1. Dr. Thein Tun, Acoustic Phonetics and The Phonology of the Myanmar Language
  2. Romanization: https://en.wikipedia.org/wiki/Romanization
  3. Myanmar Unicode: http://unicode.org/charts/PDF/U1000.pdf
  4. Syllable segmentation algorithm of Myanmar text: http://gii2.nagaokaut.ac.jp/gii/media/share/20080901-ZMM%20Presentation.pdf
  5. Zin Maung Maung and Yoshiki Makami,"A rule-based syllable segmentation of Myanmar Text", in Proceeding of the IJCNLP-08 workshop of NLP for Less Privileged Language, January, 2008, Hyderabad, India, pp. 51-58. Paper
  6. Tin Htay Hlaing, "Manually constructed context-free grammar for Myanmar syllable structure", in Proceedings of the Student Research Workshop at the 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL '12), Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 32-37. Paper
  7. Ye Kyaw Thu, Andrew Finch, Yoshinori Sagisaka and Eiichiro Sumita, "A Study of Myanmar Word Segmentation Schemes for Statistical Machine Translation", in Proceedings of the 11th International Conference on Computer Applications (ICCA 2013), February 26~27, 2013, Yangon, Myanmar, pp. 167-179. Paper
  8. Ye Kyaw Thu, Andrew Finch, Win Pa Pa, and Eiichiro Sumita, "A Large-scale Study of Statistical Machine Translation Methods for Myanmar Language", in Proceedings of SNLP2016, February 10-12, 2016, Phranakhon Si Ayutthaya, Thailand. Paper
  9. Regular Expression: https://en.wikipedia.org/wiki/Regular_expression
  10. DebuggexBeter: https://www.debuggex.com/
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].