Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → rust-han → Han Segment

rust-han / Han Segment

Licence: mit

基于隐式马尔可夫模型和正向最大化匹配的中文分词系统

Programming Languages

11053 projects

Labels

word-segmentation

Projects that are alternatives of or similar to Han Segment

dnn-lstm-word-segment

Chinese Word Segmention Base on the Deep Learning and LSTM Neural Network

Stars: ✭ 24 (+41.18%)

Mutual labels: word-segmentation

基于Tensorflow的中文分词模型

Stars: ✭ 25 (+47.06%)

Mutual labels: word-segmentation

Python port of SymSpell

Stars: ✭ 420 (+2370.59%)

Mutual labels: word-segmentation

Converts from Chinese characters to pinyin, between simplified and traditional, and does word segmentation.

Stars: ✭ 69 (+305.88%)

Mutual labels: word-segmentation

youtokentome-ruby

High performance unsupervised text tokenization for Ruby

Stars: ✭ 17 (+0%)

Mutual labels: word-segmentation

Juman++ (a Morphological Analyzer Toolkit)

Stars: ✭ 254 (+1394.12%)

Mutual labels: word-segmentation

Syllable segmentation tool for Myanmar language (Burmese) by Ye.

Stars: ✭ 44 (+158.82%)

Mutual labels: word-segmentation

Thai Natural Language Processing in Python.

Stars: ✭ 582 (+3323.53%)

Mutual labels: word-segmentation

rakutenma-python

Rakuten MA (Python version)

Stars: ✭ 15 (-11.76%)

Mutual labels: word-segmentation

Bert Multitask Learning

BERT for Multitask Learning

Stars: ✭ 380 (+2135.29%)

Mutual labels: word-segmentation

customized-symspell

Java port of SymSpell: 1 million times faster through Symmetric Delete spelling correction algorithm

Stars: ✭ 51 (+200%)

Mutual labels: word-segmentation

A toolkit for Vietnamese word segmentation

Stars: ✭ 60 (+252.94%)

Mutual labels: word-segmentation

A Japanese tokenizer based on recurrent neural networks

Stars: ✭ 260 (+1429.41%)

Mutual labels: word-segmentation

Pytorch-NLU，一个中文文本分类、序列标注工具包，支持中文长文本、短文本的多类、多标签分类任务，支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech ta…

Stars: ✭ 151 (+788.24%)

Mutual labels: word-segmentation

Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).

Stars: ✭ 433 (+2447.06%)

Mutual labels: word-segmentation

A toolkit for pre-processing large source code corpora

Stars: ✭ 39 (+129.41%)

Mutual labels: word-segmentation

Hashformers is a framework for hashtag segmentation with transformers.

Stars: ✭ 18 (+5.88%)

Mutual labels: word-segmentation

Unsupervised text tokenizer focused on computational efficiency

Stars: ✭ 728 (+4182.35%)

Mutual labels: word-segmentation

Unsupervised text tokenizer for Neural Network-based text generation.

Stars: ✭ 5,540 (+32488.24%)

Mutual labels: word-segmentation

A Vietnamese natural language processing toolkit (NAACL 2018)

Stars: ✭ 354 (+1982.35%)

Mutual labels: word-segmentation

View All Similar Projects ➔

汉语分词系统

:Date: 10/07 2018

.. contents::

介绍

一个使用 Rust 语言实现的汉语分词系统。

算法

隐式马尔可夫模型（HMM）
基于字典的正向最大化匹配（MMSEG）

字典来源

MMSEG 中文分词字典来源于 chenlb/mmseg4j-from-googlecode <https://github.com/chenlb/mmseg4j-from-googlecode>_ 。
HMM 中文分词算法所使用到的模型数据来源于 yanyiwu/cppjieba <https://github.com/yanyiwu/cppjieba>_ 。

其它相关项目

fxsjy/jieba <https://github.com/fxsjy/jieba>_ , 结巴中文分词
chenlb/mmseg4j-from-googlecode <https://github.com/chenlb/mmseg4j-from-googlecode>_ , MMSEG 中文分词 (Java)
archerhu/scel2mmseg <https://github.com/archerhu/scel2mmseg>_ , 一个搜狗细胞词库转换为MMSEG词库的工具
baidu/lac <https://github.com/baidu/lac>_ , 中文词法分析（LAC）
baidu/AnyQ <https://github.com/baidu/AnyQ>_ , 百度FAQ自动问答系统
baidu/Senta <https://github.com/baidu/Senta>_ , 百度情感识别系统

参考

MMSEG <http://technology.chtsai.org/mmseg/>_ , A Word Identification System for Mandarin Chinese Text Based on Two Variants of the Maximum Matching Algorithm
国家语委现代汉语语料库 <http://www.cncorpus.org/index.aspx>_
互联网上开放的中文语料库有哪些 <https://www.zhihu.com/question/21177095>_
搜狗实验室_语料数据 <https://www.sogou.com/labs/resource/list_yuliao.php>_

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 17

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (0) 🔗