All Projects → phongnt570 → UETsegmenter

phongnt570 / UETsegmenter

Licence: other
A toolkit for Vietnamese word segmentation

Programming Languages

java
68154 projects - #9 most used programming language

Projects that are alternatives of or similar to UETsegmenter

word tokenize
Vietnamese Word Tokenize
Stars: ✭ 45 (-25%)
Mutual labels:  vietnamese, word-segmentation
number-to-words
⚡ Thư viện hổ trợ chuyển đổi số sang chữ số Tiếng Việt.
Stars: ✭ 19 (-68.33%)
Mutual labels:  vietnamese
skt
Sanskrit compound segmentation using seq2seq model
Stars: ✭ 21 (-65%)
Mutual labels:  word-segmentation
dnn-lstm-word-segment
Chinese Word Segmention Base on the Deep Learning and LSTM Neural Network
Stars: ✭ 24 (-60%)
Mutual labels:  word-segmentation
JointIDSF
BERT-based joint intent detection and slot filling with intent-slot attention mechanism (INTERSPEECH 2021)
Stars: ✭ 55 (-8.33%)
Mutual labels:  vietnamese
community
Ông Dev Community
Stars: ✭ 64 (+6.67%)
Mutual labels:  vietnamese
WordSegmentationDP
Word Segmentation with Dynamic Programming
Stars: ✭ 18 (-70%)
Mutual labels:  word-segmentation
SymSpellCppPy
Fast SymSpell written in c++ and exposes to python via pybind11
Stars: ✭ 28 (-53.33%)
Mutual labels:  word-segmentation
classification
Vietnamese Text Classification
Stars: ✭ 39 (-35%)
Mutual labels:  vietnamese
codeprep
A toolkit for pre-processing large source code corpora
Stars: ✭ 39 (-35%)
Mutual labels:  word-segmentation
SpeakIt Vietnamese TTS
Vietnamese Text-to-Speech on Windows Project (zalo-speech)
Stars: ✭ 81 (+35%)
Mutual labels:  vietnamese
sentencepiece-jni
Java JNI wrapper for SentencePiece: unsupervised text tokenizer for Neural Network-based text generation.
Stars: ✭ 26 (-56.67%)
Mutual labels:  word-segmentation
Pytorch-NLU
Pytorch-NLU,一个中文文本分类、序列标注工具包,支持中文长文本、短文本的多类、多标签分类任务,支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech ta…
Stars: ✭ 151 (+151.67%)
Mutual labels:  word-segmentation
vietTTS
Vietnamese Text to Speech library
Stars: ✭ 78 (+30%)
Mutual labels:  vietnamese
tudien
Từ điển tiếng Việt dành cho Kindle
Stars: ✭ 38 (-36.67%)
Mutual labels:  vietnamese
ckipnlp
CKIP CoreNLP Toolkits
Stars: ✭ 92 (+53.33%)
Mutual labels:  word-segmentation
sylbreak
Syllable segmentation tool for Myanmar language (Burmese) by Ye.
Stars: ✭ 44 (-26.67%)
Mutual labels:  word-segmentation
google assistant vietnamese speaking
Đây là dự án độ lại loa thông minh chạy Google Assistant hỗ trợ đa ngôn ngữ trong đó có tiếng Việt, phần source code do Nguyễn Duy code lại từ Source Gốc của Google
Stars: ✭ 19 (-68.33%)
Mutual labels:  vietnamese
Vietnamese-Accent-Prediction
A simple/fast/accurate accent prediction for non-accented Vietnamese text
Stars: ✭ 31 (-48.33%)
Mutual labels:  vietnamese
customized-symspell
Java port of SymSpell: 1 million times faster through Symmetric Delete spelling correction algorithm
Stars: ✭ 51 (-15%)
Mutual labels:  word-segmentation

UETsegmenter

UETsegmenter is a toolkit for Vietnamese word segmentation. It uses a hybrid approach that is based on longest matching with logistic regression.

UETsegmenter is written in Java and developed in Esclipse IDE.

Note

UETsegmenter was inherited in UETnlp. UETnlp is a toolkit for Vietnamese text processing which can be used for word segmentation and POS tagging. UETnlp is much easier to use than UETsegmenter.

Overview

  • src : folder of java source code

  • uetsegmenter.jar : an executable jar file (see How to use)

  • models : a pre-trained model for Vietnamese word segmentation

  • dictionary : necessary dictionaries for word segmentation

How to use

The following command is used to run this toolkit, your PC needs JDK 1.8 or newer:

java -jar uetsegmenter.jar -r <what_to_execute> {additional arguments}

	-r	:	the method you want to execute (required: seg|train|test)

Additional arguments for each method:

  • -r seg : Method for word segmentation. Needed arguments:
-m <models_path> -i <input_path> [-ie <input_extension>] -o <output_path> [-oe <output_extension>]

	-m	:	path to the folder of segmenter model (required)
	-i	:	path to the input text (file/folder) (required)
	-ie	:	input extension, only use when input_path is a folder (default: *)
	-o	:	path to the output text (file/folder) (required)
	-oe	:	output extension, only use when output_path is a folder (default: seg)
  • -r train : Method for training a new model. Needed arguments:
-i <training_data> [-e <file_extension>] -m <models_path>

	-i	:	path to the training data (file/folder) (required)
	-e	:	file extension, only use when training_data is a folder (default: *)
	-m	:	path to the folder you want to save model after training (required)

After training, the models_path folder will contain 2 files: model and features.

  • -r test : Method for testing a model. Needed arguments:
-m <models_path> -t <test_file>

	-m	:	path to the folder of segmenter model (required)
	-t	:	path to the test file (required)

APIs

3 APIs for Vietnames word segmentation are provided:

  • Segment a raw text:
	String modelsPath = "models"; // path to the model folder. This folder must contain two files: model, features
	UETSegmenter segmenter = new UETSegmenter(modelsPath); // construct the segmenter
	String raw_text_1 = "Tốc độ truyền thông tin ngày càng cao.";
	String raw_text_2 = "Tôi yêu Việt Nam!";

	String seg_text_1 = segmenter.segment(raw_text_1); // Tốc_độ truyền thông_tin ngày_càng cao .
	String seg_text_2 = segmenter.segment(raw_text_2); // Tôi yêu Việt_Nam !

	// ... You only need to construct the segmenter one time, then you can segment any number of texts.
  • Segment a tokenized text:
	// ...
	// ... construct the segmenter

	String tokenized = "Tôi , bạn tôi yêu Việt Nam !";
	String segmented = segmenter.segmentTokenizedText(raw_text_2); // Tôi , bạn tôi yêu Việt_Nam !
  • Segment a raw text and return list of segmented sentences:
	// ...
	// ... construct the segmenter

	String text = "Tốc độ truyền thông tin ngày càng cao. Tôi, bạn tôi yêu Việt Nam!";
	List<String> segmented_sents = segmenter.segmentSentences(text); // [0] : Tốc_độ truyền thông_tin ngày_càng cao .
																	// [1] : Tôi , bạn tôi yêu Việt_Nam !

Citation

If you use the toolkit for academic work, please cite:

@INPROCEEDINGS{7800279, 
	author={T. P. Nguyen and A. C. Le}, 
	booktitle={2016 IEEE RIVF International Conference on Computing Communication Technologies, Research, Innovation, and Vision for the Future (RIVF)}, 
	title={A hybrid approach to Vietnamese word segmentation}, 
	year={2016}, 
	pages={114-119},
	doi={10.1109/RIVF.2016.7800279}, 
	month={Nov},
}

The approach used in the toolkit is also explained in the paper.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].