All Projects → EricLingRui → Nlp Tools

EricLingRui / Nlp Tools

😋本项目旨在通过Tensorflow基于BiLSTM+CRF实现中文分词、词性标注、命名实体识别(NER)。

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Nlp Tools

Cluener2020
CLUENER2020 中文细粒度命名实体识别 Fine Grained Named Entity Recognition
Stars: ✭ 689 (+206.22%)
Mutual labels:  seq2seq, ner
Fasthan
fastHan是基于fastNLP与pytorch实现的中文自然语言处理工具,像spacy一样调用方便。
Stars: ✭ 449 (+99.56%)
Mutual labels:  pos, ner
tensorflow-ml-nlp-tf2
텐서플로2와 머신러닝으로 시작하는 자연어처리 (로지스틱회귀부터 BERT와 GPT3까지) 실습자료
Stars: ✭ 245 (+8.89%)
Mutual labels:  seq2seq, ner
Kashgari
Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.
Stars: ✭ 2,235 (+893.33%)
Mutual labels:  seq2seq, ner
Nlp pytorch project
Embedding, NMT, Text_Classification, Text_Generation, NER etc.
Stars: ✭ 153 (-32%)
Mutual labels:  seq2seq, ner
Min nlp practice
Chinese & English Cws Pos Ner Entity Recognition implement using CNN bi-directional lstm and crf model with char embedding.基于字向量的CNN池化双向BiLSTM与CRF模型的网络,可能一体化的完成中文和英文分词,词性标注,实体识别。主要包括原始文本数据,数据转换,训练脚本,预训练模型,可用于序列标注研究.注意:唯一需要实现的逻辑是将用户数据转化为序列模型。分词准确率约为93%,词性标注准确率约为90%,实体标注(在本样本上)约为85%。
Stars: ✭ 107 (-52.44%)
Mutual labels:  pos, ner
Bert seq2seq
pytorch实现bert做seq2seq任务,使用unilm方案,现在也可以做自动摘要,文本分类,情感分析,NER,词性标注等任务,支持GPT2进行文章续写。
Stars: ✭ 298 (+32.44%)
Mutual labels:  seq2seq, ner
Nlp Papers
Papers and Book to look at when starting NLP 📚
Stars: ✭ 111 (-50.67%)
Mutual labels:  pos, ner
Jiagu
Jiagu深度学习自然语言处理工具 知识图谱关系抽取 中文分词 词性标注 命名实体识别 情感分析 新词发现 关键词 文本摘要 文本聚类
Stars: ✭ 2,368 (+952.44%)
Mutual labels:  pos, ner
Monpa
MONPA 罔拍是一個提供正體中文斷詞、詞性標註以及命名實體辨識的多任務模型
Stars: ✭ 203 (-9.78%)
Mutual labels:  pos, ner
Deep Time Series Prediction
Seq2Seq, Bert, Transformer, WaveNet for time series prediction.
Stars: ✭ 183 (-18.67%)
Mutual labels:  seq2seq
Kospeech
Open-Source Toolkit for End-to-End Korean Automatic Speech Recognition.
Stars: ✭ 190 (-15.56%)
Mutual labels:  seq2seq
Screenshot To Code
A neural network that transforms a design mock-up into a static website.
Stars: ✭ 13,561 (+5927.11%)
Mutual labels:  seq2seq
Headliner
🏖 Easy training and deployment of seq2seq models.
Stars: ✭ 221 (-1.78%)
Mutual labels:  seq2seq
Persian Ner
پیکره بزرگ شناسایی موجودیت‌های نامدار فارسی برچسب خورده
Stars: ✭ 183 (-18.67%)
Mutual labels:  ner
Pytorch Beam Search Decoding
PyTorch implementation of beam search decoding for seq2seq models
Stars: ✭ 204 (-9.33%)
Mutual labels:  seq2seq
Bert Sklearn
a sklearn wrapper for Google's BERT model
Stars: ✭ 182 (-19.11%)
Mutual labels:  ner
Deeptoxic
top 1% solution to toxic comment classification challenge on Kaggle.
Stars: ✭ 180 (-20%)
Mutual labels:  pos
Tgen
Statistical NLG for spoken dialogue systems
Stars: ✭ 179 (-20.44%)
Mutual labels:  seq2seq
Pymystem3
A Python wrapper of the Yandex Mystem 3.1 morphological analyzer (http://api.yandex.ru/mystem). The original tool is shipped as a binary and this library makes it easy to integrate it in Python projects. Let us know in the issues if you would like to be involved into the developments or maintenance of this project. If you have any fix or suggestion, please make a pull request. We are very open to accepting any contributions.
Stars: ✭ 224 (-0.44%)
Mutual labels:  pos

NLP-tools

本项目旨在通过Tensorflow基于BiLSTM+CRF实现字符级序列标注模型。

功能:

1、对未登录字(词)识别能力

2、Http接口

3、可快速实现分词、词性标注、NER、SRL等序列标注模型

欢迎各位大佬吐槽。

说明

环境配置:创建新的conda环境

 $ conda env create -f environment.yaml

语料处理

不同标注语料格式不同,需额外处理,在example/DataPreprocessing.ipynb中提供了人民日报2014预处理过程(该语料集未上传至github,只有部分样例于corpus,可通过互联网找到。若找不到可email me),语料格式:人民网/nz 1月4日/t 讯/ng 据/p [法国/nsf 国际/n。

生成word2id字典和训练数据于data/xx.pkl中。

模型训练

 $ python train.py 
 [-h] [--dict_path DICT_PATH] [--train_data TRAIN_DATA]
      [--ckpt_path CKPT_PATH] [--embed_size EMBED_SIZE]
      [--hidden_size HIDDEN_SIZE] [--batch_size BATCH_SIZE] 
      [--epoch EPOCH] [--lr LR]
      [--save_path SAVE_PATH]

训练生成checkpoint存入SAVE_PATH, CKPT_PATH用于模型做finetune。

模型默认超参数

  • 嵌入层向量长度:256

  • BiLstm层数:2

  • 隐藏层节点数:512

  • Batch宽度:128

  • 初始学习率:1e-3 (不同任务需做finetune)

模型测试

模型测试示例位于Modeltest.ipynb中。

HTTP接口

一个简单的web server

 $ python app.py

执行python,默认本机测试代码:(linux和windows下格式不同)

 $ curl -i -H "Content-Type: application/json" -X POST -d '{"text":"\u5f20\u51cc\u745e\u3002"}' http://localhost:7777/cws

现状

在人民日报上的分词能达到正确率97%,词性标注能达到正确率96%。

通过对该模型在上亿条句子上的训练结果测试,将CWS、POS、NER标签做成end2end的融合标签,综合正确率能达到96%,且对未登录字(词)识别能力佳,拥有对语义的捕获能力。

(在Modeltest.ipynb中列举了一些例子)

最近一直在看Google神奇BERT,后续会添加BERT的序列标注训练模块进来,让模型在不同领域进行迁移。

参考

本项目模型BiLSTM+CRF参考论文:http://www.aclweb.org/anthology/N16-1030

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].