All Projects → luozhouyang → deepseg

luozhouyang / deepseg

Licence: Apache-2.0 license
Chinese word segmentation in tensorflow 2.x

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to deepseg

ChineseNER
中文NER的那些事儿
Stars: ✭ 241 (+947.83%)
Mutual labels:  crf, bilstm-crf, bert-bilstm-crf
Slot filling and intent detection of slu
slot filling, intent detection, joint training, ATIS & SNIPS datasets, the Facebook’s multilingual dataset, MIT corpus, E-commerce Shopping Assistant (ECSA) dataset, CoNLL2003 NER, ELMo, BERT, XLNet
Stars: ✭ 298 (+1195.65%)
Mutual labels:  crf, sequence-labeling
Hscrf Pytorch
ACL 2018: Hybrid semi-Markov CRF for Neural Sequence Labeling (http://aclweb.org/anthology/P18-2038)
Stars: ✭ 284 (+1134.78%)
Mutual labels:  crf, sequence-labeling
Lstm Crf Pytorch
LSTM-CRF in PyTorch
Stars: ✭ 364 (+1482.61%)
Mutual labels:  crf, sequence-labeling
BiLSTM-CRF-NER-PyTorch
This repo contains a PyTorch implementation of a BiLSTM-CRF model for named entity recognition task.
Stars: ✭ 109 (+373.91%)
Mutual labels:  crf, bilstm-crf
Ner Pytorch
LSTM+CRF NER
Stars: ✭ 260 (+1030.43%)
Mutual labels:  crf, sequence-labeling
Bert Bilstm Crf Ner
Tensorflow solution of NER task Using BiLSTM-CRF model with Google BERT Fine-tuning And private Server services
Stars: ✭ 3,838 (+16586.96%)
Mutual labels:  crf, bert-bilstm-crf
Sltk
序列化标注工具,基于PyTorch实现BLSTM-CNN-CRF模型,CoNLL 2003 English NER测试集F1值为91.10%(word and char feature)。
Stars: ✭ 338 (+1369.57%)
Mutual labels:  crf, sequence-labeling
Named entity recognition
中文命名实体识别(包括多种模型:HMM,CRF,BiLSTM,BiLSTM+CRF的具体实现)
Stars: ✭ 995 (+4226.09%)
Mutual labels:  crf, sequence-labeling
Ntagger
reference pytorch code for named entity tagging
Stars: ✭ 58 (+152.17%)
Mutual labels:  crf, sequence-labeling
Ncrfpp
NCRF++, a Neural Sequence Labeling Toolkit. Easy use to any sequence labeling tasks (e.g. NER, POS, Segmentation). It includes character LSTM/CNN, word LSTM/CNN and softmax/CRF components.
Stars: ✭ 1,767 (+7582.61%)
Mutual labels:  crf, sequence-labeling
A Pytorch Tutorial To Sequence Labeling
Empower Sequence Labeling with Task-Aware Neural Language Model | a PyTorch Tutorial to Sequence Labeling
Stars: ✭ 257 (+1017.39%)
Mutual labels:  crf, sequence-labeling
BERT-BiLSTM-CRF
BERT-BiLSTM-CRF的Keras版实现
Stars: ✭ 40 (+73.91%)
Mutual labels:  sequence-labeling, bilstm-crf
Rnnsharp
RNNSharp is a toolkit of deep recurrent neural network which is widely used for many different kinds of tasks, such as sequence labeling, sequence-to-sequence and so on. It's written by C# language and based on .NET framework 4.6 or above versions. RNNSharp supports many different types of networks, such as forward and bi-directional network, sequence-to-sequence network, and different types of layers, such as LSTM, Softmax, sampled Softmax and others.
Stars: ✭ 277 (+1104.35%)
Mutual labels:  crf, sequence-labeling
Lm Lstm Crf
Empower Sequence Labeling with Task-Aware Language Model
Stars: ✭ 778 (+3282.61%)
Mutual labels:  crf, sequence-labeling
Pytorch ner bilstm cnn crf
End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF implement in pyotrch
Stars: ✭ 249 (+982.61%)
Mutual labels:  crf, sequence-labeling
xinlp
把李航老师《统计学习方法》的后几章的算法都用java实现了一遍,实现盒子与球的EM算法,扩展到去GMM训练,后来实现了HMM分词(实现了HMM分词的参数训练)和CRF分词(借用CRF++训练的参数模型),最后利用tensorFlow把BiLSTM+CRF实现了,然后为lucene包装了一个XinAnalyzer
Stars: ✭ 21 (-8.7%)
Mutual labels:  crf, bilstm-crf
ImcSegmentationPipeline
A pixel classification based multiplexed image segmentation pipeline
Stars: ✭ 62 (+169.57%)
Mutual labels:  segmentation
airs
Road Segmentation in Satellite Aerial Images
Stars: ✭ 51 (+121.74%)
Mutual labels:  segmentation
eta
ETA: Extensible Toolkit for Analytics
Stars: ✭ 22 (-4.35%)
Mutual labels:  segmentation

deepseg

Tensorflow 2.x 实现的神经网络分词模型!一键训练&一键部署!

tensorflow 1.x的实现请切换到tf1分支

推荐本项目使用到的两个库:

开发环境

conda create -n deepseg python=3.6
conda activate deepseg 
pip install -r requirements.txt

数据集下载

训练模型

可以使用deepseg/run_deepseg.py脚本来训练你的模型。需要提供以下参数:

  • --model,模型,可选择 bisltm-crf, bigru-crf, bert-crf, albert-crf, bert-bilstm-crf, albert-bilstm-crf
    • 如果是bert-based或者albert-based模型,请提供预训练模型路径,使用--pretrained_model_dir参数制定。
  • --model_dir,模型保存路径
  • --vocab_file,词典文件路径,注意是字符级别的词典,参考testdata/vocab_small.txt
  • --train_input_files,训练文件,分好词的文本文件,参考testdata/train_small.txt

对于bilstm-crfbigru-crf模型,还需要指定以下参数:

  • --vocab_size,词典大小
  • --embedding_size,潜入层的维度

一个使用bert-crf模型的例子如下:

python -m deepseg.run_deepseg \
    --model=bert-crf \
    --model_dir=models/bert-crf-model \
    --pretrained_model_dir=/home/zhouyang.lzy/pretrain-models/chinese_roberta_wwm_ext_L-12_H-768_A-12 \
    --train_input_files=testdata/train_small.txt \
    --vocab_file=/home/zhouyang.lzy/pretrain-models/chinese_roberta_wwm_ext_L-12_H-768_A-12/vocab.txt \
    --epochs=2 

什么?你觉得我的训练脚本写得太烂了,想自己写训练过程?

完全OK啊!

自己写训练脚本

from deepseg.dataset import DatasetBuilder, LabelMapper, TokenMapper
from deepseg.models import BiGRUCRFModel, BiLSTMCRFModel
from deepseg.models import AlbertBiLSTMCRFModel, AlbertCRFModel
from deepseg.models import BertBiLSTMCRFModel, BertCRFModel

token_mapper = TokenMapper(vocab_file='testdata/vocab_small.txt')
label_mapper = LabelMapper()

builder = DatasetBuilder(token_mapper, label_mapper)
train_dataset = builder.build_train_dataset('testdata/train_small.txt', batch_size=20, buffer_size=100)
valid_dataset = None

model_dir = 'model/bilstm-crf'
tensorboard_logdir = os.path.join(model_dir, 'logs')
saved_model_dir = os.path.join(model_dir, 'export', '{epoch}')

# 更改成你自己想要的模型,或者干脆自己构建任何你想要的模型!
model = BiLSTMCRFModel(100, 128, 3)
model.fit(
    train_dataset,
    validation_data=valid_dataset,
    epochs=10,
    callbacks=[
        tf.keras.callbacks.EarlyStopping(monitor='val_loss' if valid_dataset is not None else 'loss'),
        tf.keras.callbacks.TensorBoard(tensorboard_logdir),
        tf.keras.callbacks.ModelCheckpoint(
            saved_model_dir,
            save_best_only=False,
            save_weights_only=False)
    ]
)

部署模型

上面训练过程中,每个epoch都会保存一个SavedModel格式的模型,可以直接使用tensorflow-serving部署。

  • TODO:增加部署文档和客户端调用文档
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].