Pytorch-NLU，一个中文文本分类、序列标注工具包，支持中文长文本、短文本的多类、多标签分类任务，支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech ta…

Stars: ✭ 151 (+151.67%)

Mutual labels: word-segmentation

vietTTS

Vietnamese Text to Speech library

Stars: ✭ 78 (+30%)

Mutual labels: vietnamese

tudien

Từ điển tiếng Việt dành cho Kindle

Stars: ✭ 38 (-36.67%)

Mutual labels: vietnamese

ckipnlp

CKIP CoreNLP Toolkits

Stars: ✭ 92 (+53.33%)

Mutual labels: word-segmentation

sylbreak

Syllable segmentation tool for Myanmar language (Burmese) by Ye.

Stars: ✭ 44 (-26.67%)

Mutual labels: word-segmentation

google assistant vietnamese speaking

Đây là dự án độ lại loa thông minh chạy Google Assistant hỗ trợ đa ngôn ngữ trong đó có tiếng Việt, phần source code do Nguyễn Duy code lại từ Source Gốc của Google

Stars: ✭ 19 (-68.33%)

Mutual labels: vietnamese

Vietnamese-Accent-Prediction

A simple/fast/accurate accent prediction for non-accented Vietnamese text

Stars: ✭ 31 (-48.33%)

Mutual labels: vietnamese

customized-symspell

Java port of SymSpell: 1 million times faster through Symmetric Delete spelling correction algorithm

Stars: ✭ 51 (-15%)

Mutual labels: word-segmentation

View All Similar Projects ➔

UETsegmenter

UETsegmenter is a toolkit for Vietnamese word segmentation. It uses a hybrid approach that is based on longest matching with logistic regression.

UETsegmenter is written in Java and developed in Esclipse IDE.

Note

UETsegmenter was inherited in UETnlp. UETnlp is a toolkit for Vietnamese text processing which can be used for word segmentation and POS tagging. UETnlp is much easier to use than UETsegmenter.

Overview

src : folder of java source code
uetsegmenter.jar : an executable jar file (see How to use)
models : a pre-trained model for Vietnamese word segmentation
dictionary : necessary dictionaries for word segmentation

How to use

The following command is used to run this toolkit, your PC needs JDK 1.8 or newer:

java -jar uetsegmenter.jar -r <what_to_execute> {additional arguments}

	-r	:	the method you want to execute (required: seg|train|test)

Additional arguments for each method:

-r seg : Method for word segmentation. Needed arguments:

-m <models_path> -i <input_path> [-ie <input_extension>] -o <output_path> [-oe <output_extension>]

	-m	:	path to the folder of segmenter model (required)
	-i	:	path to the input text (file/folder) (required)
	-ie	:	input extension, only use when input_path is a folder (default: *)
	-o	:	path to the output text (file/folder) (required)
	-oe	:	output extension, only use when output_path is a folder (default: seg)

-r train : Method for training a new model. Needed arguments:

-i <training_data> [-e <file_extension>] -m <models_path>

	-i	:	path to the training data (file/folder) (required)
	-e	:	file extension, only use when training_data is a folder (default: *)
	-m	:	path to the folder you want to save model after training (required)

After training, the models_path folder will contain 2 files: model and features.

-r test : Method for testing a model. Needed arguments:

-m <models_path> -t <test_file>

	-m	:	path to the folder of segmenter model (required)
	-t	:	path to the test file (required)

APIs

3 APIs for Vietnames word segmentation are provided:

Segment a raw text:

	String modelsPath = "models"; // path to the model folder. This folder must contain two files: model, features
	UETSegmenter segmenter = new UETSegmenter(modelsPath); // construct the segmenter
	String raw_text_1 = "Tốc độ truyền thông tin ngày càng cao.";
	String raw_text_2 = "Tôi yêu Việt Nam!";

	String seg_text_1 = segmenter.segment(raw_text_1); // Tốc_độ truyền thông_tin ngày_càng cao .
	String seg_text_2 = segmenter.segment(raw_text_2); // Tôi yêu Việt_Nam !

	// ... You only need to construct the segmenter one time, then you can segment any number of texts.

Segment a tokenized text:

	// ...
	// ... construct the segmenter

	String tokenized = "Tôi , bạn tôi yêu Việt Nam !";
	String segmented = segmenter.segmentTokenizedText(raw_text_2); // Tôi , bạn tôi yêu Việt_Nam !

Segment a raw text and return list of segmented sentences:

	// ...
	// ... construct the segmenter

	String text = "Tốc độ truyền thông tin ngày càng cao. Tôi, bạn tôi yêu Việt Nam!";
	List<String> segmented_sents = segmenter.segmentSentences(text); // [0] : Tốc_độ truyền thông_tin ngày_càng cao .
																	// [1] : Tôi , bạn tôi yêu Việt_Nam !

Citation

If you use the toolkit for academic work, please cite:

@INPROCEEDINGS{7800279, 
	author={T. P. Nguyen and A. C. Le}, 
	booktitle={2016 IEEE RIVF International Conference on Computing Communication Technologies, Research, Innovation, and Vision for the Future (RIVF)}, 
	title={A hybrid approach to Vietnamese word segmentation}, 
	year={2016}, 
	pages={114-119},
	doi={10.1109/RIVF.2016.7800279}, 
	month={Nov},
}

The approach used in the toolkit is also explained in the paper.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

phongnt570 / UETsegmenter

Programming Languages

Labels

Projects that are alternatives of or similar to UETsegmenter

UETsegmenter

Note

Overview

How to use

APIs

Citation