All Projects → ponrawee → ssg

ponrawee / ssg

Licence: Apache-2.0 license
CRF syllable segmenter for Thai

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to ssg

toSkoy
เเอปเเปลงพ๊ษ๊ไธญเป็นภ๊ษ๊สก๊อบ์ย (รุ่นใหฒ่ล่๊ษุฎ) (Plain English : One-way encryption algorithm for Thai language, which only Thai people could understand)
Stars: ✭ 52 (+160%)
Mutual labels:  thai
odoo-th
Ready to use Odoo with OCA Thai localization modules
Stars: ✭ 29 (+45%)
Mutual labels:  thai
thaigov-corpus
โครงการเก็บรวบรวมข่าวสารจากเว็บไซต์รัฐบาลไทย
Stars: ✭ 19 (-5%)
Mutual labels:  thai
torpleng
การต่อเพลงไทยที่ยาวที่สุดในประวัติศาสตร์
Stars: ✭ 39 (+95%)
Mutual labels:  thai
.dev
รวมความรู้ด้าน Coding เป็นภาษาไทย
Stars: ✭ 20 (+0%)
Mutual labels:  thai
Awesome-Thai-Library
แหล่งรวม library ไทยๆ เกี่ยวกับ "ประเทศไทย" และ "ภาษาไทย" - Delightful Thai packages and resources
Stars: ✭ 37 (+85%)
Mutual labels:  thai
thai-date
Display date in Thai use same PHP date() and strftime() function attributes.
Stars: ✭ 14 (-30%)
Mutual labels:  thai
vue-thailand-address-autocomplete
🇹🇭 Autocomplete ที่อยู่ในประเทศไทย
Stars: ✭ 49 (+145%)
Mutual labels:  thai
TALPCo
TUFS Asian Language Parallel Corpus
Stars: ✭ 32 (+60%)
Mutual labels:  thai
vue-thailand-address
🇹🇭 Thai address input for Vue.
Stars: ✭ 44 (+120%)
Mutual labels:  thai
thaiaddress
A Python Parser for Thai address
Stars: ✭ 33 (+65%)
Mutual labels:  thai

CRF syllable segmenter for Thai

Build Status

ssg is a syllable segmenter for Thai using Conditional Random Fields. This is part of work from Natural Language Processing Lab @Chula, under the supervision of Dr. Attapol Thamrongrattanarit.

Installation

foo@bar~$: pip install ssg

Usage

To use,

from ssg import syllable_tokenize
syllable_tokenize('ทดสอบ') # returns ['ทด', 'สอบ']

ssg also comes with its own CLI.

foo@bar~$: ssg-cli PATH_TO_INPUT PATH_TO_OUTPUT

Model

The model itself is stored in ssg/artifacts/crf3_mix.crfsuite2.

Data

The dataset used for training is a 5,600,000-character human-annotated subcorpus of the Thai National Corpus, trained using python-crfsuite

Parameters

  • L1 penalty: 1.0
  • L2 penalty: 1e-3
  • Includes possible transitions that are not observed (features.possible_transitions is set to True)

Features

  • Sliding window features (all possible character (N-1)-gram on both sides of a potential boundary up to a radius of N on both sides)
  • Individual character features (each of the characters surrounding a potential boundary within the window of size N)

Performance

--- to be updated ---

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].