All Projects → rust-han → Han Segment

rust-han / Han Segment

Licence: mit
基于隐式马尔可夫模型和正向最大化匹配的中文分词系统

Programming Languages

rust
11053 projects

Projects that are alternatives of or similar to Han Segment

dnn-lstm-word-segment
Chinese Word Segmention Base on the Deep Learning and LSTM Neural Network
Stars: ✭ 24 (+41.18%)
Mutual labels:  word-segmentation
cws-tensorflow
基于Tensorflow的中文分词模型
Stars: ✭ 25 (+47.06%)
Mutual labels:  word-segmentation
Symspellpy
Python port of SymSpell
Stars: ✭ 420 (+2370.59%)
Mutual labels:  word-segmentation
hanzi-tools
Converts from Chinese characters to pinyin, between simplified and traditional, and does word segmentation.
Stars: ✭ 69 (+305.88%)
Mutual labels:  word-segmentation
youtokentome-ruby
High performance unsupervised text tokenization for Ruby
Stars: ✭ 17 (+0%)
Mutual labels:  word-segmentation
Jumanpp
Juman++ (a Morphological Analyzer Toolkit)
Stars: ✭ 254 (+1394.12%)
Mutual labels:  word-segmentation
sylbreak
Syllable segmentation tool for Myanmar language (Burmese) by Ye.
Stars: ✭ 44 (+158.82%)
Mutual labels:  word-segmentation
Pythainlp
Thai Natural Language Processing in Python.
Stars: ✭ 582 (+3323.53%)
Mutual labels:  word-segmentation
rakutenma-python
Rakuten MA (Python version)
Stars: ✭ 15 (-11.76%)
Mutual labels:  word-segmentation
Bert Multitask Learning
BERT for Multitask Learning
Stars: ✭ 380 (+2135.29%)
Mutual labels:  word-segmentation
customized-symspell
Java port of SymSpell: 1 million times faster through Symmetric Delete spelling correction algorithm
Stars: ✭ 51 (+200%)
Mutual labels:  word-segmentation
UETsegmenter
A toolkit for Vietnamese word segmentation
Stars: ✭ 60 (+252.94%)
Mutual labels:  word-segmentation
Nagisa
A Japanese tokenizer based on recurrent neural networks
Stars: ✭ 260 (+1429.41%)
Mutual labels:  word-segmentation
Pytorch-NLU
Pytorch-NLU,一个中文文本分类、序列标注工具包,支持中文长文本、短文本的多类、多标签分类任务,支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech ta…
Stars: ✭ 151 (+788.24%)
Mutual labels:  word-segmentation
Ekphrasis
Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).
Stars: ✭ 433 (+2447.06%)
Mutual labels:  word-segmentation
codeprep
A toolkit for pre-processing large source code corpora
Stars: ✭ 39 (+129.41%)
Mutual labels:  word-segmentation
hashformers
Hashformers is a framework for hashtag segmentation with transformers.
Stars: ✭ 18 (+5.88%)
Mutual labels:  word-segmentation
Youtokentome
Unsupervised text tokenizer focused on computational efficiency
Stars: ✭ 728 (+4182.35%)
Mutual labels:  word-segmentation
Sentencepiece
Unsupervised text tokenizer for Neural Network-based text generation.
Stars: ✭ 5,540 (+32488.24%)
Mutual labels:  word-segmentation
Vncorenlp
A Vietnamese natural language processing toolkit (NAACL 2018)
Stars: ✭ 354 (+1982.35%)
Mutual labels:  word-segmentation

汉语分词系统

:Date: 10/07 2018

.. contents::

介绍

一个使用 Rust 语言实现的汉语分词系统。

算法

  1. 隐式马尔可夫模型(HMM)
  2. 基于字典的正向最大化匹配(MMSEG)

字典来源

  1. MMSEG 中文分词字典来源于 chenlb/mmseg4j-from-googlecode <https://github.com/chenlb/mmseg4j-from-googlecode>_ 。
  2. HMM 中文分词算法所使用到的模型数据来源于 yanyiwu/cppjieba <https://github.com/yanyiwu/cppjieba>_ 。

其它相关项目

  • fxsjy/jieba <https://github.com/fxsjy/jieba>_ , 结巴中文分词
  • chenlb/mmseg4j-from-googlecode <https://github.com/chenlb/mmseg4j-from-googlecode>_ , MMSEG 中文分词 (Java)
  • archerhu/scel2mmseg <https://github.com/archerhu/scel2mmseg>_ , 一个搜狗细胞词库转换为MMSEG词库的工具
  • baidu/lac <https://github.com/baidu/lac>_ , 中文词法分析(LAC)
  • baidu/AnyQ <https://github.com/baidu/AnyQ>_ , 百度FAQ自动问答系统
  • baidu/Senta <https://github.com/baidu/Senta>_ , 百度情感识别系统

参考

  • MMSEG <http://technology.chtsai.org/mmseg/>_ , A Word Identification System for Mandarin Chinese Text Based on Two Variants of the Maximum Matching Algorithm
  • 国家语委现代汉语语料库 <http://www.cncorpus.org/index.aspx>_
  • 互联网上开放的中文语料库有哪些 <https://www.zhihu.com/question/21177095>_
  • 搜狗实验室_语料数据 <https://www.sogou.com/labs/resource/list_yuliao.php>_
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].