All Projects → mrpeerat → SEFR_CUT

mrpeerat / SEFR_CUT

Licence: MIT license
Domain Adaptation of Thai Word Segmentation Models using Stacked Ensemble (EMNLP2020)

Programming Languages

Jupyter Notebook
11667 projects
python
139335 projects - #7 most used programming language
Dockerfile
14818 projects

Projects that are alternatives of or similar to SEFR CUT

GRADE
GRADE: Automatic Graph-Enhanced Coherence Metric for Evaluating Open-Domain Dialogue Systems
Stars: ✭ 50 (+177.78%)
Mutual labels:  emnlp2020
pytorch sscr
A PyTorch implementation of SSCR
Stars: ✭ 25 (+38.89%)
Mutual labels:  emnlp2020
EMNLP2020
This is official Pytorch code and datasets of the paper "Where Are the Facts? Searching for Fact-checked Information to Alleviate the Spread of Fake News", EMNLP 2020.
Stars: ✭ 55 (+205.56%)
Mutual labels:  emnlp2020
task-transferability
Data and code for our paper "Exploring and Predicting Transferability across NLP Tasks", to appear at EMNLP 2020.
Stars: ✭ 35 (+94.44%)
Mutual labels:  emnlp2020
OSKut
Handling Cross- and Out-of-Domain Samples in Thai Word Segmentation (ACL 2021 Findings).
Stars: ✭ 18 (+0%)
Mutual labels:  wordsegmentation

SEFR CUT (Stacked Ensemble Filter and Refine for Word Segmentation)

Domain Adaptation of Thai Word Segmentation Models using Stacked Ensemble (EMNLP 2020)
CRF as Stacked Model and DeepCut as Baseline model

Read more:

Citation

@inproceedings{limkonchotiwat-etal-2020-domain,
    title = "Domain Adaptation of {T}hai Word Segmentation Models using Stacked Ensemble",
    author = "Limkonchotiwat, Peerat  and
      Phatthiyaphaibun, Wannaphong  and
      Sarwar, Raheem  and
      Chuangsuwanich, Ekapol  and
      Nutanong, Sarana",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    year = "2020",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.emnlp-main.315",
}

Install

pip install sefr_cut

How To use

Requirements

  • python >= 3.6
  • python-crfsuite >= 0.9.7
  • pyahocorasick == 1.4.0

Example

Load Engine & Engine Mode

  • ws1000, tnhc, and BEST !!
    • ws1000: The model trained on Wisesight-1000 and test on Wisesight-160
    • tnhc: The model trained on TNHC (80:20 train&test split with random seed 42)
    • BEST: The model trained on BEST-2010 Corpus (NECTEC)
    sefr_cut.load_model(engine='ws1000')
    # OR
    sefr_cut.load_model(engine='tnhc')
    # OR
    sefr_cut.load_model(engine='best')
  • tl-deepcut-XXXX
    • We also provide transfer learning of deepcut on 'Wisesight' as tl-deepcut-ws1000 and 'TNHC' as tl-deepcut-tnhc
    sefr_cut.load_model(engine='tl-deepcut-ws1000')
    # OR
    sefr_cut.load_model(engine='tl-deepcut-tnhc')
  • deepcut
    • We also provide the original deepcut
    sefr_cut.load_model(engine='deepcut')

Segment Example

You need to read the paper to understand why we have $k$ value!

  • Tokenize with default k-value
    sefr_cut.load_model(engine='ws1000')
    print(sefr_cut.tokenize(['สวัสดีประเทศไทย','ลุงตู่สู้ๆ']))
    print(sefr_cut.tokenize(['สวัสดีประเทศไทย']))
    print(sefr_cut.tokenize('สวัสดีประเทศไทย'))
    
    [['สวัสดี', 'ประเทศ', 'ไทย'], ['ลุง', 'ตู่', 'สู้', 'ๆ']]
    [['สวัสดี', 'ประเทศ', 'ไทย']]
    [['สวัสดี', 'ประเทศ', 'ไทย']]
  • Tokenize with a various k-value
    sefr_cut.load_model(engine='ws1000')
    print(sefr_cut.tokenize(['สวัสดีประเทศไทย','ลุงตู่สู้ๆ'],k=5)) # refine only 5% of character number
    print(sefr_cut.tokenize(['สวัสดีประเทศไทย','ลุงตู่สู้ๆ'],k=100)) # refine 100% of character number
    
    [['สวัสดี', 'ประเทศไทย'], ['ลุงตู่', 'สู้', 'ๆ']]
    [['สวัสดี', 'ประเทศ', 'ไทย'], ['ลุง', 'ตู่', 'สู้', 'ๆ']]

Evaluation

  • We also provide Character & Word Evaluation by call function evaluation()
    • For example
    answer = 'สวัสดี|ประเทศไทย'
    pred = 'สวัสดี|ประเทศ|ไทย'
    char_score,word_score = sefr_cut.evaluation(answer,pred)
    print(f'Word Score: {word_score} Char Score: {char_score}')
    
    Word Score: 0.4 Char Score: 0.8
    
    answer = ['สวัสดี|ประเทศไทย']
    pred = ['สวัสดี|ประเทศ|ไทย']
    char_score,word_score = sefr_cut.evaluation(answer,pred)
    print(f'Word Score: {word_score} Char Score: {char_score}')
    
    Word Score: 0.4 Char Score: 0.8
    
    
    answer = [['สวัสดี|'],['ประเทศไทย']]
    pred = [['สวัสดี|'],['ประเทศ|ไทย']]
    char_score,word_score = sefr_cut.evaluation(answer,pred)
    print(f'Word Score: {word_score} Char Score: {char_score}')
    
    Word Score: 0.4 Char Score: 0.8

Performance

How to re-train the model?

  • You can re-train the model. The example is in the folder Notebooks We provided everything for you!!

    Re-train Model

    • You can run the notebook file #2, the corpus inside 'Notebooks/corpus/' is Wisesight-1000, you can try with BEST, TNHC, and LST20 !
    • Rename variable name: CRF_model_name
    • Link:HERE

    Filter and Refine Example

    • Set variable name CRF_model_name same as file#2
    • If you want to know why we use filter-and-refine, you can try to uncomment 3 lines in score_() function
    #answer = scoring_function(y_true,cp.deepcopy(y_pred),entropy_index_og)
    #f1_hypothesis.append(eval_function(y_true,answer))
    #ax.plot(range(start,K_num,step),f1_hypothesis,c="r",marker='o',label='Best case')
    

    Use your trained model?

    • Just move your model inside 'Notebooks/model/' to 'seft_cut/model/' and call model in one line.
    SEFR_CUT.load_model(engine='my_model')

Thank you many code from

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].