supercoderhawk / Dnn_cws
利用深度学习实现中文分词
Stars: ✭ 58
Programming Languages
python
139335 projects - #7 most used programming language
Projects that are alternatives of or similar to Dnn cws
Nlpcc Wordseg Weibo
NLPCC 2016 微博分词评测项目
Stars: ✭ 120 (+106.9%)
Mutual labels: chinese-word-segmentation
Nlp4han
中文自然语言处理工具集【断句/分词/词性标注/组块/句法分析/语义分析/NER/N元语法/HMM/代词消解/情感分析/拼写检查】
Stars: ✭ 206 (+255.17%)
Mutual labels: chinese-word-segmentation
Friso
High performance Chinese tokenizer with both GBK and UTF-8 charset support based on MMSEG algorithm developed by ANSI C. Completely based on modular implementation and can be easily embedded in other programs, like: MySQL, PostgreSQL, PHP, etc.
Stars: ✭ 313 (+439.66%)
Mutual labels: chinese-word-segmentation
NLPIR-ICTCLAS
The Java Package of NLPIR-ICTCLAS.
Stars: ✭ 16 (-72.41%)
Mutual labels: chinese-word-segmentation
Chinesenlp
Datasets, SOTA results of every fields of Chinese NLP
Stars: ✭ 1,206 (+1979.31%)
Mutual labels: chinese-word-segmentation
Jcseg
Jcseg is a light weight NLP framework developed with Java. Provide CJK and English segmentation based on MMSEG algorithm, With also keywords extraction, key sentence extraction, summary extraction implemented based on TEXTRANK algorithm. Jcseg had a build-in http server and search modules for the latest lucene,solr,elasticsearch
Stars: ✭ 754 (+1200%)
Mutual labels: chinese-word-segmentation
Monpa
MONPA 罔拍是一個提供正體中文斷詞、詞性標註以及命名實體辨識的多任務模型
Stars: ✭ 203 (+250%)
Mutual labels: chinese-word-segmentation
nlpir-analysis-cn-ictclas
Lucene/Solr Analyzer Plugin. Support MacOS,Linux x86/64,Windows x86/64. It's a maven project, which allows you change the lucene/solr version. //Maven工程,修改Lucene/Solr版本,以兼容相应版本。
Stars: ✭ 71 (+22.41%)
Mutual labels: chinese-word-segmentation
G2pc
g2pC: A Context-aware Grapheme-to-Phoneme Conversion module for Chinese
Stars: ✭ 155 (+167.24%)
Mutual labels: chinese-word-segmentation
Pyhanlp
中文分词 词性标注 命名实体识别 依存句法分析 新词发现 关键词短语提取 自动摘要 文本分类聚类 拼音简繁 自然语言处理
Stars: ✭ 2,564 (+4320.69%)
Mutual labels: chinese-word-segmentation
Cross-Domain-CWS
Code for IJCAI 2018 paper "Neural Networks Incorporating Unlabeled and Partially-labeled Data for Cross-domain Chinese Word Segmentation"
Stars: ✭ 14 (-75.86%)
Mutual labels: chinese-word-segmentation
Symspell
SymSpell: 1 million times faster spelling correction & fuzzy search through Symmetric Delete spelling correction algorithm
Stars: ✭ 1,976 (+3306.9%)
Mutual labels: chinese-word-segmentation
Greedycws
Source code for an ACL2017 paper on Chinese word segmentation
Stars: ✭ 88 (+51.72%)
Mutual labels: chinese-word-segmentation
Jieba Rs
The Jieba Chinese Word Segmentation Implemented in Rust
Stars: ✭ 219 (+277.59%)
Mutual labels: chinese-word-segmentation
Pkuseg Python
pkuseg多领域中文分词工具; The pkuseg toolkit for multi-domain Chinese word segmentation
Stars: ✭ 5,692 (+9713.79%)
Mutual labels: chinese-word-segmentation
berserker
Berserker - BERt chineSE woRd toKenizER
Stars: ✭ 17 (-70.69%)
Mutual labels: chinese-word-segmentation
基于深度学习的中文分词
使用TensorFlow实现基于深度学习的中文分词
本项目使用python3
编写,没有支持python2
的计划。
注:本项目主要是为了进行中文分词等相关自然语言处理研究而创建,暂时不推荐在正式的生产环境使用,另外本项目目前还在开发阶段
使用方法
准备
- 安装tensorflow:
pip install tensorflow
-
clone本项目至本地.
-
运行文件
init.py
,生成训练用数据
开始使用
在本项目文件夹下创建一个文件,在里面添加如下代码并运行:
from seg_dnn import SegDNN
import constant
cws = SegDNN(constant.VOCAB_SIZE,50,constant.DNN_SKIP_WINDOW)
print(cws.seg('我爱北京天安门')[0])
详细示例可见文件test.py
相关代码文件说明
-
seg_dnn.py
: 使用(感知机式)神经网络进行中文分词,对应论文1 -
seg_lstm.py
: 使用LSTM神经网络进行中文分词,对应论文2 -
seg_mmtnn.py
: 使用MMTNN网络进行中分分词,对应论文3 -
prepare_data.py
: 预处理语料库,包括msr和pku -
init.py
: 用于生成进行训练和测试的数据的脚本文件
参考论文:
-
deep learning for chinese word segmentation and pos tagging (已完全实现,文件
seg_dnn.py
) -
Long Short-Term Memory Neural Networks for Chinese Word Segmentation (基本实现,正在改进,文件
seg_lstm.py
) -
Max-Margin Tensor Neural Network for Chinese Word Segmentation (正在实现,文件
seg_mmtnn.py
)
Todo List
- [ ] 支持
pip
- [ ] 添加更加详细的注释
- [ ] 提供词性标注功能
Note that the project description data, including the texts, logos, images, and/or trademarks,
for each open source project belongs to its rightful owner.
If you wish to add or remove any projects, please contact us at [email protected].