All Projects → Kyubyong → Mtp

Kyubyong / Mtp

Multi-lingual Text Processing

Projects that are alternatives of or similar to Mtp

Gohn
Hatena Notation (はてな記法) Parser written in Go
Stars: ✭ 17 (-80.46%)
Mutual labels:  text-processing
Pipeit
PipeIt is a text transformation, conversion, cleansing and extraction tool.
Stars: ✭ 57 (-34.48%)
Mutual labels:  text-processing
Unix Text Commands
Unix Text Processing Command Reference
Stars: ✭ 78 (-10.34%)
Mutual labels:  text-processing
Text Mining
Text Mining in Python
Stars: ✭ 18 (-79.31%)
Mutual labels:  text-processing
Pyparsing
Python library for creating PEG parsers
Stars: ✭ 1,052 (+1109.2%)
Mutual labels:  text-processing
Javascript Text Expander
Expands texts as you type, naturally
Stars: ✭ 58 (-33.33%)
Mutual labels:  text-processing
Python Nameparser
A simple Python module for parsing human names into their individual components
Stars: ✭ 462 (+431.03%)
Mutual labels:  text-processing
Ios11 Visionframework
Vision Framework IOS WWDC 2017
Stars: ✭ 85 (-2.3%)
Mutual labels:  text-processing
Lingua Franca
Mycroft's multilingual text parsing and formatting library
Stars: ✭ 51 (-41.38%)
Mutual labels:  text-processing
Virastar
Cleaning-up Persian Texts!
Stars: ✭ 77 (-11.49%)
Mutual labels:  text-processing
Concise Ipython Notebooks For Deep Learning
Ipython Notebooks for solving problems like classification, segmentation, generation using latest Deep learning algorithms on different publicly available text and image data-sets.
Stars: ✭ 23 (-73.56%)
Mutual labels:  text-processing
Qp Trie Rs
An idiomatic and fast QP-trie implementation in pure Rust.
Stars: ✭ 47 (-45.98%)
Mutual labels:  text-processing
Applied Text Mining In Python
Repo for Applied Text Mining in Python (coursera) by University of Michigan
Stars: ✭ 59 (-32.18%)
Mutual labels:  text-processing
Chr
🔤 Lightweight R package for manipulating [string] characters
Stars: ✭ 18 (-79.31%)
Mutual labels:  text-processing
Kefirbb
A flexible Java text processor. BB, BBCode, BB-code, HTML, Textile, Markdown, parser, translator, converter.
Stars: ✭ 83 (-4.6%)
Mutual labels:  text-processing
Whatlanggo
Natural language detection library for Go
Stars: ✭ 479 (+450.57%)
Mutual labels:  text-processing
Go Search Replace
🚀 Search & replace URLs in WordPress SQL files.
Stars: ✭ 57 (-34.48%)
Mutual labels:  text-processing
Nostril
Nostril: Nonsense String Evaluator
Stars: ✭ 86 (-1.15%)
Mutual labels:  text-processing
Node Rake
A NodeJS implementation of the Rapid Automatic Keyword Extraction algorithm.
Stars: ✭ 85 (-2.3%)
Mutual labels:  text-processing
Ter
Text Expression Runner – Readable and easy to use text expressions
Stars: ✭ 67 (-22.99%)
Mutual labels:  text-processing

Multi-lingual Text Processing

This is for my tech talk at Naver on September 6, 2018.

Why Multi-lingual Text Processing?

Yes! Modeling is fancy. Data processing is tedious. You don't want to do that. I know. But from my experience it's often data processing that determines the performance of your experiement rather than modeling. If you can't avoid, it's better do it right.

Why Multi-lingual Text Processing?

You can obtain many techniques of image processing through many routes. More importantly, I'm not an expert in it. Let me focus on text, which is one of the two most typical modalities along with sound when handling language .

Why Multi-lingual Text Processing?

If you're interested in a single language, say, English, it's fine. But if you touch a language you're not familiar with for some reason, you may need some knowledge on it.

Basic Text Processing

(Main source: Lecture slides from the Stanford Coursera course)

Regular Expressions

  • Syntax for processing strings
  • LIBRARY regex (third-party): You can use unicode category expressions such as '\p{Han}' for all Chinese characters and '\p{Latin}' for the Latin script.
  • ONLINE https://regexr.com/
  • SOFTWARE PowerGrep

Tokenization

  • Token: a unit like character, subword (bpe), word, mwe, sentence, etc.
  • Character
    • Simple (😄)
    • Small vocabulary (< 100) (😄)
    • Robust to rare words (😄)
    • Long sequence (😭)
  • Subword
    • Best performance in machine translation (😄)
    • Robust to rare words (😄)
    • Not intuitive (😭)
    • Data-dependent (😭)
  • Word
    • Usually simple (😄)
    • Short sequence (😄)
    • Transfer learning (😄)
    • Large vocabulary (> 10000) (😭)
    • Weak in rare words (😭)
  • MWE (Multi-word expression)
    • Idioms e.g., ‘kick the bucket’
    • Compounds e.g., ‘San Francisco’
    • Phrasal verbs e.g. ‘get … across’
    • PROJECT Multiword Expression Project
  • Sentence
    • Usually identified by a sentenc ending symbol (.!?)
    • Period (.) is sometimes ambiguous.
    • Abbreviations like Inc. or Dr.
    • Numbers like .02% or 4.3

Normalization

Lemmatization

  • Lemma: the canonical or dictionary form of a set of words
    • E.g., produce, produced, production -> produce
  • WHY? Dictionary lookup
  • HOW? Linguistic knowledge
  • LIBRARY nltk wordnet lemmatizer

Stemming

  • Stem: the part of the word that never changes even when morphologically inflected
    • E.g., produce, produced, production -> produc-
  • WHY? Query-document match
  • HOW? Sequence of rules
  • LIBRARY nltk stemmers

Unicode Normalization

(Main source: unicode.org)

  • Canonical equivalence: a fundamental equivalency between characters which represent the same abstract character
    • E.g., combining sequence: Ç ↔ C+◌̧
    • E.g., ordering of combining marks: q+◌̇+◌̣ ↔ q+◌̣+◌̇
  • Compatibility equivalence: a weaker type of equivalence between characters which represent the same abstract character, but which may have distinct visual appearances or behaviors
    • E.g., circled variants: ① → 1
    • E.g., width variants: カ → カ
  • NFD: Canonical Decomposition
  • NFKD: Compatibility Decomposition
  • NFC: NFD + Canonical Composition
  • NFKC: NFKD + Canonical Composition
  • Examples
  • Typically NFC is desirable for string matching.
  • NFKC is useful if you don't want to distinguish compatibility-equivalent characters like full- and half-width characters.
  • Strip diacritics: to ASCII characters
import unicodedata
def strip_diacritics(str):
	return ''.join(char for char in unicodedata.normalize('NFD', str)
                   if unicodedata.category(char) != 'Mn')

Writing Systems

(Main source: omniglot)

Alphabets

  • Corresponds to one or more phonemes.
  • Latin alphabet (AaBbCc), Cyrillic alphabet (кириллица), Hangul (한글)
  • Hangul
  • There is a fixed order.
  • Consonants and vowels stand alone.
  • Desirable for computer processing.

Abjads (= Consonant alphabets)

  • Each letter stands for a consonant, leaving the reader to supply the vowel.
  • "Cn y ndrstnd ths?"
  • Arabic script (عربى), Hebrew script (עִברִית)
  • 'book' in Arabic (= 'kitaab')

Abugidas

  • Consonants (Primary) + Vowels (Secondary)
  • Devanagari (देवनागरी), Tamil (தமிழ்)
  • Devanagari compounds

Syllabaries

  • Corresponds to a syllable that is not further decomposed.
  • Hiragana (ひらがな), Katakana (カタカナ)
  • Phonemic transcription is often useful.
    • E.g., かわいい -> ka wa i i

Logographs

  • Each letter represents an abstract concept.
  • Chinese characters
  • Many letters
  • Challenging for processing
  • Phonemic transcription is often useful.
    • E.g., 我爱你 -> wǒ ài nǐ

IPA (International Phonetic Alphabet)

  • Universal alphabet
  • IPA Chart
  • Each distinctive sound is represented as a single letter. (/sh/ -> /ʃ/, /th/ -> /θ/, /ng/ -> /ŋ/)
  • Slashes (/ /) for phonemic transcription (e.g., 'pin' /pɪn/ vs. 'spin' /spɪn/)
  • Square brackets ([ ]) for phonetic transcription. (e.g., 'pin' [ɪn] vs. 'spin' [spɪn])

ARPABET

Languages

(Main sources: Relevant Wiki pages)

Arabic

  • CHAR SET [ \p{Arabic}.؟!،0-9]
  • Written from right to left
  • Cursive
  • No distinct upper and lower case letter forms
  • Comma (،), and question mark (؟) are different from those of English.
  • Many dialects with varying orthographies exist.
  • Clitics are attached to a stem any orthographic marks like an apostrophe. (See Fahad Alotaiby et al.)
    • مستواك "your level" -> ك "your" + مستوى "level"
  • TOOL Stanford Arabic Segmenter

Dutch

  • CHAR SET [ A-Za-z.!?'-0-9]
    • Digraph 'ij' is considered the same as 'y'. (See this)

English

  • CHAR SET [ A-Za-z.!?'-0-9]
  • Diacrtics are optional.
    • E.g., naïve = naive, façade = facade, résumé = resume
  • Period (.) is used at the end of a sentence or for abbreviations.
    • E.g., etc., i.e., e.g.
  • Most hyphens in compounds can be replaced with a space.
    • E.g., state-of-the-art = state of the art
  • Apostrophe (') can construct clitics.
    • E.g. I'm (=I am), we've (=we have)
  • The closing quotation mark (’) and apostrophe (') are often mixed up. (Read this)
  • Many words have more than one spelling. (E.g., gray / grey)
  • Graphemes and phonemes are not directly linked. In other words, it's not always possible to infer the pronunciation of a word from its spelling. Therefore in speech synthesis a preprocessor that converts graphemes to phonemes is often used. (Check English g2p)
  • Compared to such languages as Chinese, Japanese, or Thai, tokenization is not so important. You can simply divide text into sentences by [.!?] and words by a white space, respectively at the sacrifice of accuracy. (Check nltk tokenize)
  • To identify multi word expressions is not always easy.

French

  • CHAR SET [ A-Za-zçÉéÀàÈèÙùÂâÊêÎîÔôÛûœæ.!?'-0-9]
  • Diacritics on captial letters are often ignored.
  • Mostly two ligatures 'œ' and 'æ' are the same as 'oe' and 'ae', respectively.
  • Hyphen (-) is used before a pronoun in imperative sentences.
    • Donne-les-moi ! "Give them to me!""
  • Clitics with a apostrophe (')
    • E.g., je t'aime "I love you"

German

  • CHAR SET [ A-Za-zÄäÖöÜüẞß.!?'-0-9]
  • Nouns are written in capital letters.
  • No space for compound nouns (Check compound splitter)
    • E.g., Rinderwahnsinn "mad cow syndrome"
  • 'ß' and 'ss' are interchangeable.

Greek

  • CHAR SET [ \p{Greek}.!;'-0-9]
  • β (beta), θ (theta), and χ (chi) are used as phonetic symbols in the IPA.
  • The letter sigma 'Σ' has two different lowercase forms, 'σ' and 'ς'. 'ς' is used in word-final position and 'σ' elsewhere. (Read this)
  • Semicolon (;) is used as a question mark.

Hindi

  • CHAR SET [ \p{Devanagari}0-9|?!]
  • Vertical line (|) is used at the end of a sentence.
  • Indian numbering system is special.
    • E.g., 1,00,00,00,000

Japanese

  • CHAR SET [\p{Hiragana}\p{Katakana}\p{Han}A-Za-z0-90-9。、?!]
  • No space between words
  • Both full- and half-width arabic numbers are used.
  • Note that period, comma, question mark, and exclamation mark are different from English ones.
  • Often people depend on Romanization to input Japanese in the digital setting. Romanization to Japanese conversion is very important. (Check this)
  • A morph analyzer functions as a tokenizer and a grapheme to phoneme converter. (Check MeCab)
  • When は /ha/ is used as a topic marker it is pronounced as /wa/.

Korean

  • CHAR SET [ \p{Hangul}A-Za-z.!?0-9]
  • Consonants and vowels, called 'jamo' in Korean, combine to form a syllable, which has an independent code point.
    • E.g., ㅎ (314E)+ㅏ (314F) +ㄴ(3134) ->한 (D55C)
  • Jamo has two types: Hangul compatibility Jamo and Hangul Jamo.
    • Hangul Compatibility Jamo (U+3130-U+318F)
      • Composes a syllable
      • In computer keyboards
      • The consonants in the onset and the coda are identical.
    • Hangul Jamo (U+1100-U+11FF)
      • Used mostly when representing old Hangul
      • The consonants in the onset and the coda are NOT identical.
      • If you need to decompose Hangul syllables, Hangul Jamo is better than Hangul Compatibility Jamo. (Check this)
  • Orthography is notoriously difficult. For that reason you can't expect any unofficial writing will obey the rules.
  • Grammar checker is hard to make. (But surprisingly there is a decent one. Check this )
  • Like German, many compounds are created by merging two words without a space.
    • E.g., 점심시간 "lunch time" (= 점심 "lunch" + 시간 "time")
  • Hangul is phonetic, but the current orthography policy respects the origin of words rather than reflecting sound itself. As a result, sometimes the real pronunciation of some words is different from its grapheme.
    • E.g., 독립 dok rip (spelling) -> /dong nip/ (pronunciation) "independence"
  • TOOL Python-jamo: Hangul syllable decomposition and synthesis library
  • TOOL KoG2P

Mandarin

  • CHAR SET [\p{Han}。、,!?0-9]
  • There are two types of commas: , and 、. Ideographic comma (、) is used when enumerating items in a list.(e.g. 红色、白色、黄色 "red, white, and yellow").
  • No space between words
  • Pinyin, the standard Romanization system for Mandarin, is used.
  • 5 different tones are marked by diacritics in pinyin.
    • mā (high level)
    • má (rising)
    • mǎ (falling and rising)
    • mà (falling)
    • ma (neutral)
  • There are two types of characters: simplfied and traditional. The former is used in the mainland, wheras the latter is used in Taiwan and Korea.
  • Check this to see the list of characters that are differntly used in Chinese, Japanese, and Korean.
  • Typically people type pinyin to input Chinese characters in the digital setting. The pinyin to Chinese conversion is very important. (Check this)
  • TOOL pypinyin: a python project for getting pinyin for Chinese words or sentence
  • TOOL Jieba: Chinese text segmentation module
  • TOOL hanziconv: tool converts between simplified and traditional Chinese Characters

Persian

  • CHAR SET [ \p{Arabic}.؟!،0-9]
  • Check Arabic
  • When a Zero-Width Non-Joiner (ZWNJ) is used between two characters, it forces a final form on the preceding character. (See this)

Portuguese

  • CHAR SET [ \p{Latin}.?!'-0-9]
  • The hyphen (-) is used to make compound words
    • E.g., levaria + vos + os = levar-vos-ia "I would take to you"

Russian

  • CHAR SET [ \p{Cyrillic}.!?'-0-9]

Spanish

  • CHAR SET [ \p{Latin}.!¡?¿'-0-9]
  • ¿ is used at the beginning of a interrogative sentence, pairing with ?.
  • ¡ is used at the beginning of a exclamatory sentence, paring with !.

Thai

Vietnamese

  • CHAR SET [ \p{Latin}.!?'-0-9]
  • 6 different tones are marked by diacritics.
    • a (mid level)
    • à (low falling)
    • ả (mid falling)
    • ã (glottalized rising)
    • á (high rising)
    • ạ (glottalized falling)
  • Spaces are used to separate syllables, not words.
    • E.g., thuế thu nhập cá nhâ -> thuế "tax" + thu_nhập "income" + cá_nhân "individual"
  • INFO word segmentation tools
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].