SEFR CUT (Stacked Ensemble Filter and Refine for Word Segmentation)
Domain Adaptation of Thai Word Segmentation Models using Stacked Ensemble (EMNLP 2020)
CRF as Stacked Model and DeepCut as Baseline model
Read more:
- Paper: Domain Adaptation of Thai Word Segmentation Models using Stacked Ensemble
- Blog: Domain Adaptation กับตัวตัดคำ มันดีย์จริงๆ
- New verision (ACL2021): OSKut (Out-of-domain StacKed cut for Word Segmentation)
Citation
@inproceedings{limkonchotiwat-etal-2020-domain,
title = "Domain Adaptation of {T}hai Word Segmentation Models using Stacked Ensemble",
author = "Limkonchotiwat, Peerat and
Phatthiyaphaibun, Wannaphong and
Sarwar, Raheem and
Chuangsuwanich, Ekapol and
Nutanong, Sarana",
booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
year = "2020",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.emnlp-main.315",
}
Install
pip install sefr_cut
How To use
Requirements
- python >= 3.6
- python-crfsuite >= 0.9.7
- pyahocorasick == 1.4.0
Example
- Example files are on SEFR Example notebook
- Try it on Colab
Load Engine & Engine Mode
- ws1000, tnhc, and BEST !!
- ws1000: The model trained on Wisesight-1000 and test on Wisesight-160
- tnhc: The model trained on TNHC (80:20 train&test split with random seed 42)
- BEST: The model trained on BEST-2010 Corpus (NECTEC)
sefr_cut.load_model(engine='ws1000') # OR sefr_cut.load_model(engine='tnhc') # OR sefr_cut.load_model(engine='best')
- tl-deepcut-XXXX
- We also provide transfer learning of deepcut on 'Wisesight' as tl-deepcut-ws1000 and 'TNHC' as tl-deepcut-tnhc
sefr_cut.load_model(engine='tl-deepcut-ws1000') # OR sefr_cut.load_model(engine='tl-deepcut-tnhc')
- deepcut
- We also provide the original deepcut
sefr_cut.load_model(engine='deepcut')
Segment Example
You need to read the paper to understand why we have
- Tokenize with default k-value
sefr_cut.load_model(engine='ws1000') print(sefr_cut.tokenize(['สวัสดีประเทศไทย','ลุงตู่สู้ๆ'])) print(sefr_cut.tokenize(['สวัสดีประเทศไทย'])) print(sefr_cut.tokenize('สวัสดีประเทศไทย')) [['สวัสดี', 'ประเทศ', 'ไทย'], ['ลุง', 'ตู่', 'สู้', 'ๆ']] [['สวัสดี', 'ประเทศ', 'ไทย']] [['สวัสดี', 'ประเทศ', 'ไทย']]
- Tokenize with a various k-value
sefr_cut.load_model(engine='ws1000') print(sefr_cut.tokenize(['สวัสดีประเทศไทย','ลุงตู่สู้ๆ'],k=5)) # refine only 5% of character number print(sefr_cut.tokenize(['สวัสดีประเทศไทย','ลุงตู่สู้ๆ'],k=100)) # refine 100% of character number [['สวัสดี', 'ประเทศไทย'], ['ลุงตู่', 'สู้', 'ๆ']] [['สวัสดี', 'ประเทศ', 'ไทย'], ['ลุง', 'ตู่', 'สู้', 'ๆ']]
Evaluation
- We also provide Character & Word Evaluation by call function
evaluation()
- For example
answer = 'สวัสดี|ประเทศไทย' pred = 'สวัสดี|ประเทศ|ไทย' char_score,word_score = sefr_cut.evaluation(answer,pred) print(f'Word Score: {word_score} Char Score: {char_score}') Word Score: 0.4 Char Score: 0.8 answer = ['สวัสดี|ประเทศไทย'] pred = ['สวัสดี|ประเทศ|ไทย'] char_score,word_score = sefr_cut.evaluation(answer,pred) print(f'Word Score: {word_score} Char Score: {char_score}') Word Score: 0.4 Char Score: 0.8 answer = [['สวัสดี|'],['ประเทศไทย']] pred = [['สวัสดี|'],['ประเทศ|ไทย']] char_score,word_score = sefr_cut.evaluation(answer,pred) print(f'Word Score: {word_score} Char Score: {char_score}') Word Score: 0.4 Char Score: 0.8
Performance
How to re-train the model?
- You can re-train the model. The example is in the folder Notebooks We provided everything for you!!
Re-train Model
- You can run the notebook file #2, the corpus inside 'Notebooks/corpus/' is Wisesight-1000, you can try with BEST, TNHC, and LST20 !
- Rename variable name:
CRF_model_name
- Link:HERE
Filter and Refine Example
- Set variable name
CRF_model_name
same as file#2 - If you want to know why we use filter-and-refine, you can try to uncomment 3 lines in
score_()
function
#answer = scoring_function(y_true,cp.deepcopy(y_pred),entropy_index_og) #f1_hypothesis.append(eval_function(y_true,answer)) #ax.plot(range(start,K_num,step),f1_hypothesis,c="r",marker='o',label='Best case')
- Link:HERE
Use your trained model?
- Just move your model inside 'Notebooks/model/' to 'seft_cut/model/' and call model in one line.
SEFR_CUT.load_model(engine='my_model')
Thank you many code from
- Deepcut (Baseline Model) : We used some of code from Deepcut to perform transfer learning
- @bact (CRF training code) : We used some from https://github.com/bact/nlp-thai