All Projects → JayYip → cws-tensorflow

JayYip / cws-tensorflow

Licence: other
基于Tensorflow的中文分词模型

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to cws-tensorflow

SynThai
Thai Word Segmentation and Part-of-Speech Tagging with Deep Learning
Stars: ✭ 41 (+64%)
Mutual labels:  word-segmentation
pytorch Joint-Word-Segmentation-and-POS-Tagging
Paper: A Simple and Effective Neural Model for Joint Word Segmentation and POS Tagging
Stars: ✭ 37 (+48%)
Mutual labels:  word-segmentation
customized-symspell
Java port of SymSpell: 1 million times faster through Symmetric Delete spelling correction algorithm
Stars: ✭ 51 (+104%)
Mutual labels:  word-segmentation
spell
Spelling correction and string segmentation written in Go
Stars: ✭ 24 (-4%)
Mutual labels:  word-segmentation
skt
Sanskrit compound segmentation using seq2seq model
Stars: ✭ 21 (-16%)
Mutual labels:  word-segmentation
codeprep
A toolkit for pre-processing large source code corpora
Stars: ✭ 39 (+56%)
Mutual labels:  word-segmentation
Monpa
MONPA 罔拍是一個提供正體中文斷詞、詞性標註以及命名實體辨識的多任務模型
Stars: ✭ 203 (+712%)
Mutual labels:  word-segmentation
youtokentome-ruby
High performance unsupervised text tokenization for Ruby
Stars: ✭ 17 (-32%)
Mutual labels:  word-segmentation
sentencepiece-jni
Java JNI wrapper for SentencePiece: unsupervised text tokenizer for Neural Network-based text generation.
Stars: ✭ 26 (+4%)
Mutual labels:  word-segmentation
hanzi-tools
Converts from Chinese characters to pinyin, between simplified and traditional, and does word segmentation.
Stars: ✭ 69 (+176%)
Mutual labels:  word-segmentation
sentencepiece
R package for Byte Pair Encoding / Unigram modelling based on Sentencepiece
Stars: ✭ 22 (-12%)
Mutual labels:  word-segmentation
ckipnlp
CKIP CoreNLP Toolkits
Stars: ✭ 92 (+268%)
Mutual labels:  word-segmentation
dnn-lstm-word-segment
Chinese Word Segmention Base on the Deep Learning and LSTM Neural Network
Stars: ✭ 24 (-4%)
Mutual labels:  word-segmentation
word tokenize
Vietnamese Word Tokenize
Stars: ✭ 45 (+80%)
Mutual labels:  word-segmentation
SymSpellCppPy
Fast SymSpell written in c++ and exposes to python via pybind11
Stars: ✭ 28 (+12%)
Mutual labels:  word-segmentation
esapp
An unsupervised Chinese word segmentation tool.
Stars: ✭ 13 (-48%)
Mutual labels:  word-segmentation
sylbreak
Syllable segmentation tool for Myanmar language (Burmese) by Ye.
Stars: ✭ 44 (+76%)
Mutual labels:  word-segmentation
rakutenma-python
Rakuten MA (Python version)
Stars: ✭ 15 (-40%)
Mutual labels:  word-segmentation
UETsegmenter
A toolkit for Vietnamese word segmentation
Stars: ✭ 60 (+140%)
Mutual labels:  word-segmentation
Pytorch-NLU
Pytorch-NLU,一个中文文本分类、序列标注工具包,支持中文长文本、短文本的多类、多标签分类任务,支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech ta…
Stars: ✭ 151 (+504%)
Mutual labels:  word-segmentation

Tensorflow中文分词模型

注: 如果对准确度比较高的要求, 请使用 https://github.com/JayYip/bert-multitask-learning

部分代码参考 TensorFlow Model Zoo

运行环境:

  • Python 3.5 / Python 2.7
  • Tensorflow r1.4
  • Windows / Ubuntu 16.04
  • hanziconv 0.3.2
  • numpy

训练模型

1. 建立训练数据

进入到data目录下,执行以下命令

DATA_OUTPUT="output_dir"

python build_pku_msr_input.py \ 
    --num_threads=4 \
    --output_dir=${DATA_OUTPUT}

2. 字符嵌入

2.1 预训练好的字嵌入

  1. configuration.py中的ModelConfigself.random_embedding设置为False
  2. Polygot下载中文字嵌入数据集至项目目录,运行项目目录下process_chr_embedding.py
EMBEDDING_DIR=...
VOCAB_DIR=...

python process_chr_embedding.py \
    --chr_embedding_dir=${EMBEDDING_DIR}
    --vocab_dir=${VOCAB_DIR}

2.2 随机初始化字嵌入

configuration.py中的ModelConfigself.random_embedding设置为True

3. 训练模型

根据需要修改configuration.py里面的模型及训练参数,开始训练模型。 以下参数如不提供将会使用默认值。

TRAIN_INPUT="data\${DATA_OUTPUT}"
MODEL="save_model"

python train.py \
    --input_file_dir=${TRAIN_INPUT} \
    --train_dir=${MODEL} \
    --log_every_n_steps=10
    

使用训练好的模型进行分词

编码须为utf8,检测的后缀为'txt','csv', 'utf8'。

INF_INPUT=...
INF_OUTPUT=...

python inference.py \
    --input_file_dir=${INF_INPUT} \
    --train_dir=${MODEL} \
    --vocab_dir=${VOCAB_DIR} \
    --out_dir=${INF_OUTPUT}

如何根据自己需要修改算法

本模型使用的是单向LSTM+CRF,但是提供了算法修改的可能性。在lstm_based_cws_model.py文件中的

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].