All Projects → liu-nlper → Sltk

liu-nlper / Sltk

序列化标注工具,基于PyTorch实现BLSTM-CNN-CRF模型,CoNLL 2003 English NER测试集F1值为91.10%(word and char feature)。

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Sltk

Slot filling and intent detection of slu
slot filling, intent detection, joint training, ATIS & SNIPS datasets, the Facebook’s multilingual dataset, MIT corpus, E-commerce Shopping Assistant (ECSA) dataset, CoNLL2003 NER, ELMo, BERT, XLNet
Stars: ✭ 298 (-11.83%)
Mutual labels:  crf, sequence-labeling
Named entity recognition
中文命名实体识别(包括多种模型:HMM,CRF,BiLSTM,BiLSTM+CRF的具体实现)
Stars: ✭ 995 (+194.38%)
Mutual labels:  crf, sequence-labeling
Lstm Crf Pytorch
LSTM-CRF in PyTorch
Stars: ✭ 364 (+7.69%)
Mutual labels:  crf, sequence-labeling
Hscrf Pytorch
ACL 2018: Hybrid semi-Markov CRF for Neural Sequence Labeling (http://aclweb.org/anthology/P18-2038)
Stars: ✭ 284 (-15.98%)
Mutual labels:  crf, sequence-labeling
A Pytorch Tutorial To Sequence Labeling
Empower Sequence Labeling with Task-Aware Neural Language Model | a PyTorch Tutorial to Sequence Labeling
Stars: ✭ 257 (-23.96%)
Mutual labels:  crf, sequence-labeling
Ntagger
reference pytorch code for named entity tagging
Stars: ✭ 58 (-82.84%)
Mutual labels:  crf, sequence-labeling
Lm Lstm Crf
Empower Sequence Labeling with Task-Aware Language Model
Stars: ✭ 778 (+130.18%)
Mutual labels:  crf, sequence-labeling
deepseg
Chinese word segmentation in tensorflow 2.x
Stars: ✭ 23 (-93.2%)
Mutual labels:  crf, sequence-labeling
Ner Pytorch
LSTM+CRF NER
Stars: ✭ 260 (-23.08%)
Mutual labels:  crf, sequence-labeling
Ncrfpp
NCRF++, a Neural Sequence Labeling Toolkit. Easy use to any sequence labeling tasks (e.g. NER, POS, Segmentation). It includes character LSTM/CNN, word LSTM/CNN and softmax/CRF components.
Stars: ✭ 1,767 (+422.78%)
Mutual labels:  crf, sequence-labeling
Pytorch ner bilstm cnn crf
End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF implement in pyotrch
Stars: ✭ 249 (-26.33%)
Mutual labels:  crf, sequence-labeling
Rnnsharp
RNNSharp is a toolkit of deep recurrent neural network which is widely used for many different kinds of tasks, such as sequence labeling, sequence-to-sequence and so on. It's written by C# language and based on .NET framework 4.6 or above versions. RNNSharp supports many different types of networks, such as forward and bi-directional network, sequence-to-sequence network, and different types of layers, such as LSTM, Softmax, sampled Softmax and others.
Stars: ✭ 277 (-18.05%)
Mutual labels:  crf, sequence-labeling
keras-bert-ner
Keras solution of Chinese NER task using BiLSTM-CRF/BiGRU-CRF/IDCNN-CRF model with Pretrained Language Model: supporting BERT/RoBERTa/ALBERT
Stars: ✭ 7 (-97.93%)
Mutual labels:  crf
knowledge-graph-nlp-in-action
从模型训练到部署,实战知识图谱(Knowledge Graph)&自然语言处理(NLP)。涉及 Tensorflow, Bert+Bi-LSTM+CRF,Neo4j等 涵盖 Named Entity Recognition,Text Classify,Information Extraction,Relation Extraction 等任务。
Stars: ✭ 58 (-82.84%)
Mutual labels:  crf
fairseq-tagging
a Fairseq fork for sequence tagging/labeling tasks
Stars: ✭ 26 (-92.31%)
Mutual labels:  sequence-labeling
Rnn For Joint Nlu
Tensorflow implementation of "Attention-Based Recurrent Neural Network Models for Joint Intent Detection and Slot Filling" (https://arxiv.org/abs/1609.01454)
Stars: ✭ 281 (-16.86%)
Mutual labels:  sequence-labeling
pytorch-partial-crf
CRF, Partial CRF and Marginal CRF in PyTorch
Stars: ✭ 23 (-93.2%)
Mutual labels:  crf
entity recognition
Entity recognition codes for "2019 Datagrand Cup: Text Information Extraction Challenge"
Stars: ✭ 26 (-92.31%)
Mutual labels:  crf
grobid-ner
A Named-Entity Recogniser based on Grobid.
Stars: ✭ 38 (-88.76%)
Mutual labels:  crf
Macropodus
自然语言处理工具Macropodus,基于Albert+BiLSTM+CRF深度学习网络架构,中文分词,词性标注,命名实体识别,新词发现,关键词,文本摘要,文本相似度,科学计算器,中文数字阿拉伯数字(罗马数字)转换,中文繁简转换,拼音转换。tookit(tool) of NLP,CWS(chinese word segnment),POS(Part-Of-Speech Tagging),NER(name entity recognition),Find(new words discovery),Keyword(keyword extraction),Summarize(text summarization),Sim(text similarity),Calculate(scientific calculator),Chi2num(chinese number to arabic number)
Stars: ✭ 309 (-8.58%)
Mutual labels:  crf

SLTK - Sequence Labeling Toolkit

序列化标注工具,基于PyTorch实现BLSTM-CNN-CRF模型,CoNLL 2003 English NER测试集F1值为91.10%(word and char feature)。

1. 快速开始

1.1 安装依赖项

$ sudo pip3 install -r requirements.txt --upgrade  # for all user
$ pip3 install -r requirements.txt --upgrade --user  # for current user

1.2 预处理&训练

$ CUDA_VISIBLE_DEVICES=0 python3 main.py --config ./configs/word.yml -p --train

1.3 训练

若已经完成了预处理,则可直接进行训练:

$ CUDA_VISIBLE_DEVICES=0 python3 main.py --config ./configs/word.yml --train

1.4 测试

$ CUDA_VISIBLE_DEVICES=0 python3 main.py --config ./configs/word.yml --test

2. 配置文件说明

修改配置文件需遵循yaml语法格式。

2.1 训练|开发|测试数据

数据为conllu格式,每列之间用制表符或空格分隔,句子之间用空行分隔,标签在最后一列(若有标签)。

修改配置文件中data_params下的path_trainpath_devpath_test参数。其中,若path_dev为空,则在训练时会按照model_params.dev_size参数,将训练集划分一部分作为开发集。

2.2 特征

若训练数据包含多列特征,则可通过修改配置文件中的data_params.feature_cols指定使用其中某几列特征,data_params.feature_names为特征的别名,需和data_params.feature_cols等长。

data_params.alphabet_params.min_counts: 在构建特征的词汇表时,该参数用于过滤频次小于指定值的特征。

model_params.embed_sizes: 指定特征的维度,若提供预训练特征向量,则以预训练向量维度为准。

model_params.require_grads: 设定特征的embedding table是否需要更新。

model_params.use_char: 是否使用char level的特征。

2.3 预训练特征向量

data_params.path_pretrain: 指定预训练的特征向量,该参数中元素格式需要和data_params.feature_names中的顺序一致(可设为null)。

2.4 其他特征

word_norm: 是否对单词中的数字进行处理(仅将数字转换为0);

max_len_limit: batch的长度限制。训练时,一个批量的长度是由该批量中最长的句子决定的,若最大句子长度超出此限制,则该批量长度被强制设为该值;

all_in_memory: 预处理之后,数据被存放在hdf5格式文件中,该数据对象默认存储在磁盘中,根据索引值实时进行加载;若需要加快数据读取速度,可将该值设为true(适用于小数据量)。

3. 性能

下表列出了在CoNLL 2003 NER测试集的性能,特征和参数设置与Ma等(2016)一致。

表. 实验结果

模型 % P % R % F1
Lample et al. (2016) - - 90.94
Ma et al. (2016) 91.35 91.06 91.21
BGRU 85.50 85.89 85.69
BLSTM 88.05 87.19 87.62
BLSTM-CNN 89.21 90.48 89.84
BLSTM-CNN-CRF 91.01 91.19 91.10

注:

  • CoNLL 2003语料下载地址: CoNLL 2003 NER,标注方式需要转换为BIESO
  • 词向量下载地址: glove.6B.zip,词向量需修改为word2vec词向量格式,即txt文件的首部需要有'词表大小 向量维度'信息。

4. Requirements

  • python3
    • gensim
    • h5py
    • numpy
    • torch==0.4.0
    • pyyaml

5. 参考

  1. Lample G, Ballesteros M, et al. Neural Architectures for Named Entity Recognition. NANCL, 2016.

  2. Ma X, and Hovy E. End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF. ACL, 2016.

Updating...

  • clip: RNN层的梯度裁剪;

  • deterministic: 模型的可重现性;

  • one-hot编码字符向量;

  • lstm抽取字符层面特征;

  • 单机多卡训练。

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].