All Projects → xiangking → ark-nlp

xiangking / ark-nlp

Licence: Apache-2.0 license
A private nlp coding package, which quickly implements the SOTA solutions.

Programming Languages

python
139335 projects - #7 most used programming language
Jupyter Notebook
11667 projects

Projects that are alternatives of or similar to ark-nlp

textwiser
[AAAI 2021] TextWiser: Text Featurization Library
Stars: ✭ 26 (-88.79%)
Mutual labels:  bert
viewpoint-mining
参考NER,基于BERT的电商评论观点挖掘和情感分析
Stars: ✭ 31 (-86.64%)
Mutual labels:  bert
Self-Supervised-Embedding-Fusion-Transformer
The code for our IEEE ACCESS (2020) paper Multimodal Emotion Recognition with Transformer-Based Self Supervised Feature Fusion.
Stars: ✭ 57 (-75.43%)
Mutual labels:  bert
ai web RISKOUT BTS
국방 리스크 관리 플랫폼 (🏅 국방부장관상/Minister of National Defense Award)
Stars: ✭ 18 (-92.24%)
Mutual labels:  bert
MRC Competition Dureader
机器阅读理解 冠军/亚军代码及中文预训练MRC模型
Stars: ✭ 552 (+137.93%)
Mutual labels:  bert
SentimentAnalysis
(BOW, TF-IDF, Word2Vec, BERT) Word Embeddings + (SVM, Naive Bayes, Decision Tree, Random Forest) Base Classifiers + Pre-trained BERT on Tensorflow Hub + 1-D CNN and Bi-Directional LSTM on IMDB Movie Reviews Dataset
Stars: ✭ 40 (-82.76%)
Mutual labels:  bert
JD2Skills-BERT-XMLC
Code and Dataset for the Bhola et al. (2020) Retrieving Skills from Job Descriptions: A Language Model Based Extreme Multi-label Classification Framework
Stars: ✭ 33 (-85.78%)
Mutual labels:  bert
bert-AAD
Adversarial Adaptation with Distillation for BERT Unsupervised Domain Adaptation
Stars: ✭ 27 (-88.36%)
Mutual labels:  bert
contextualSpellCheck
✔️Contextual word checker for better suggestions
Stars: ✭ 274 (+18.1%)
Mutual labels:  bert
Pytorch-NLU
Pytorch-NLU,一个中文文本分类、序列标注工具包,支持中文长文本、短文本的多类、多标签分类任务,支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech ta…
Stars: ✭ 151 (-34.91%)
Mutual labels:  bert
parsbert-ner
🤗 ParsBERT Persian NER Tasks
Stars: ✭ 15 (-93.53%)
Mutual labels:  bert
Xpersona
XPersona: Evaluating Multilingual Personalized Chatbot
Stars: ✭ 54 (-76.72%)
Mutual labels:  bert
muse-as-service
REST API for sentence tokenization and embedding using Multilingual Universal Sentence Encoder.
Stars: ✭ 45 (-80.6%)
Mutual labels:  bert
bert-movie-reviews-sentiment-classifier
Build a Movie Reviews Sentiment Classifier with Google's BERT Language Model
Stars: ✭ 12 (-94.83%)
Mutual labels:  bert
ganbert-pytorch
Enhancing the BERT training with Semi-supervised Generative Adversarial Networks in Pytorch/HuggingFace
Stars: ✭ 60 (-74.14%)
Mutual labels:  bert
tfbert
基于tensorflow1.x的预训练模型调用,支持单机多卡、梯度累积,XLA加速,混合精度。可灵活训练、验证、预测。
Stars: ✭ 54 (-76.72%)
Mutual labels:  bert
Tianchi2020ChineseMedicineQuestionGeneration
2020 阿里云天池大数据竞赛-中医药文献问题生成挑战赛
Stars: ✭ 20 (-91.38%)
Mutual labels:  bert
LMMS
Language Modelling Makes Sense - WSD (and more) with Contextual Embeddings
Stars: ✭ 79 (-65.95%)
Mutual labels:  bert
GoEmotions-pytorch
Pytorch Implementation of GoEmotions 😍😢😱
Stars: ✭ 95 (-59.05%)
Mutual labels:  bert
DeepNER
An Easy-to-use, Modular and Prolongable package of deep-learning based Named Entity Recognition Models.
Stars: ✭ 9 (-96.12%)
Mutual labels:  bert

ark-nlp

ark-nlp主要是收集和复现学术与工作中常用的NLP模型

环境

  • python 3
  • torch >= 1.0.0, <1.10.0
  • tqdm >= 4.56.0
  • jieba >= 0.42.1
  • transformers >= 3.0.0
  • zhon >= 1.1.5
  • scipy >= 1.2.0
  • scikit-learn >= 0.17.0

pip安装

pip install --upgrade ark-nlp

项目结构

ark_nlp 开源的自然语言处理库
ark_nlp.dataset 封装数据加载、处理和转化等功能
ark_nlp.nn 封装一些完整的神经网络模型
ark_nlp.processor 封装分词器、词典和构图器等
ark_nlp.factory 封装损失函数、优化器、训练和预测等功能
ark_nlp.model 按实际NLP任务封装常用的模型,方便调用

实现的模型

预训练模型

模型 参考文献
BERT BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding
ERNIE1.0 ERNIE:Enhanced Representation through Knowledge Integration
NEZHA NEZHA:Neural Contextualized Representation For Chinese Language Understanding
Roformer Roformer: Enhanced Transformer with Rotary Position Embedding

文本分类 (Text Classification)

模型 简介
RNN/CNN/GRU/LSTM 经典的RNN, CNN, GRU, LSTM等经典文本分类结构
BERT/ERNIE 常用的预训练模型分类

文本匹配 (Text Matching)

模型 简介
BERT/ERNIE 常用的预训练模型匹配分类
UnsupervisedSimcse 无监督Simcse匹配算法
CoSENT CoSENT:比Sentence-BERT更有效的句向量方案

命名实体识别 (Named Entity Recognition)

模型 参考文献 论文源码
CRF BERT
Biaffine BERT
Span BERT
Global Pointer BERT GlobalPointer:用统一的方式处理嵌套和非嵌套NER
Efficient Global Pointer BERT Efficient GlobalPointer:少点参数,多点效果
W2NER BERT Unified Named Entity Recognition as Word-Word Relation Classification github

关系抽取 (Relation Extraction)

模型 参考文献 论文源码
Casrel A Novel Cascade Binary Tagging Framework for Relational Triple Extraction github
PRGC PRGC: Potential Relation and Global Correspondence Based Joint Relational Triple Extraction github

实际应用

使用例子

完整代码可参考test文件夹

  • 文本分类

    import torch
    import pandas as pd
    
    from ark_nlp.model.tc.bert import Bert
    from ark_nlp.model.tc.bert import BertConfig
    from ark_nlp.model.tc.bert import Dataset
    from ark_nlp.model.tc.bert import Task
    from ark_nlp.model.tc.bert import get_default_model_optimizer
    from ark_nlp.model.tc.bert import Tokenizer
    
    # 加载数据集
    # train_data_df的columns必选包含"text"和"label"
    # text列为文本,label列为分类标签
    tc_train_dataset = Dataset(train_data_df)
    tc_dev_dataset = Dataset(dev_data_df)
    
    # 加载分词器
    tokenizer = Tokenizer(vocab='nghuyong/ernie-1.0', max_seq_len=30)
    
    # 文本切分、ID化
    tc_train_dataset.convert_to_ids(tokenizer)
    tc_dev_dataset.convert_to_ids(tokenizer)
    
    # 加载预训练模型
    config = BertConfig.from_pretrained('nghuyong/ernie-1.0',
                                       num_labels=len(tc_train_dataset.cat2id))
    dl_module = Bert.from_pretrained('nghuyong/ernie-1.0', 
                                     config=config)
    
    # 任务构建
    num_epoches = 10
    batch_size = 32
    optimizer = get_default_model_optimizer(dl_module)
    model = Task(dl_module, optimizer, 'ce', cuda_device=0)
    
    # 训练
    model.fit(tc_train_dataset, 
              tc_dev_dataset,
              lr=2e-5,
              epochs=5, 
              batch_size=batch_size
             )
    
    # 推断
    from ark_nlp.model.tc.bert import Predictor
    
    tc_predictor_instance = Predictor(model.module, tokenizer, tc_train_dataset.cat2id)
    
    tc_predictor_instance.predict_one_sample(待预测文本)
  • 文本匹配

    import torch
    import pandas as pd
    
    from ark_nlp.model.tm.bert import Bert
    from ark_nlp.model.tm.bert import BertConfig
    from ark_nlp.model.tm.bert import Dataset
    from ark_nlp.model.tm.bert import Task
    from ark_nlp.model.tm.bert import get_default_model_optimizer
    from ark_nlp.model.tm.bert import Tokenizer
    
    # 加载数据集
    # train_data_df的columns必选包含"text_a"、"text_b"和"label"
    # text_a和text_b列为文本,label列为匹配标签
    tm_train_dataset = Dataset(train_data_df)
    tm_dev_dataset = Dataset(dev_data_df)
    
    # 加载分词器
    tokenizer = Tokenizer(vocab='nghuyong/ernie-1.0', max_seq_len=30)
    
    # 文本切分、ID化
    tm_train_dataset.convert_to_ids(tokenizer)
    tm_dev_dataset.convert_to_ids(tokenizer)
    
    # 加载预训练模型
    config = BertConfig.from_pretrained('nghuyong/ernie-1.0', 
                                       num_labels=len(tm_train_dataset.cat2id))
    dl_module = Bert.from_pretrained('nghuyong/ernie-1.0', 
                                     config=config)
    
    # 任务构建
    num_epoches = 10
    batch_size = 32
    optimizer = get_default_model_optimizer(dl_module)
    model = Task(dl_module, optimizer, 'ce', cuda_device=0)
    
    # 训练
    model.fit(tm_train_dataset, 
              tm_dev_dataset,
              lr=2e-5,
              epochs=5, 
              batch_size=batch_size
             )
    
    # 推断
    from ark_nlp.model.tm.bert import Predictor
    
    tm_predictor_instance = Predictor(model.module, tokenizer, tm_train_dataset.cat2id)
    
    tm_predictor_instance.predict_one_sample([待预测文本A, 待预测文本B])
  • 命名实体

    import torch
    import pandas as pd
    
    from ark_nlp.model.ner.crf_bert import CRFBert
    from ark_nlp.model.ner.crf_bert import CRFBertConfig
    from ark_nlp.model.ner.crf_bert import Dataset
    from ark_nlp.model.ner.crf_bert import Task
    from ark_nlp.model.ner.crf_bert import get_default_model_optimizer
    from ark_nlp.model.ner.crf_bert import Tokenizer
    
    # 加载数据集
    # train_data_df的columns必选包含"text"和"label"
    # text列为文本
    # label列为列表形式,列表中每个元素是如下组织的字典
    # {'start_idx': 实体首字符在文本的位置, 'end_idx': 实体尾字符在文本的位置, 'type': 实体类型标签, 'entity': 实体}
    ner_train_dataset = Dataset(train_data_df)
    ner_dev_dataset = Dataset(dev_data_df)
    
    # 加载分词器
    tokenizer = Tokenizer(vocab='nghuyong/ernie-1.0', max_seq_len=30)
    
    # 文本切分、ID化
    ner_train_dataset.convert_to_ids(tokenizer)
    ner_dev_dataset.convert_to_ids(tokenizer)
    
    # 加载预训练模型
    config = CRFBertConfig.from_pretrained('nghuyong/ernie-1.0', 
                                      num_labels=len(ner_train_dataset.cat2id))
    dl_module = CRFBert.from_pretrained('nghuyong/ernie-1.0', 
                                        config=config)
    
    # 任务构建
    num_epoches = 10
    batch_size = 32
    optimizer = get_default_model_optimizer(dl_module)
    model = Task(dl_module, optimizer, 'ce', cuda_device=0)
    
    # 训练
    model.fit(ner_train_dataset, 
              ner_dev_dataset,
              lr=2e-5,
              epochs=5, 
              batch_size=batch_size
             )
    
    # 推断
    from ark_nlp.model.ner.crf_bert import Predictor
    
    ner_predictor_instance = Predictor(model.module, tokenizer, ner_train_dataset.cat2id)
    
    ner_predictor_instance.predict_one_sample(待抽取文本)
  • Casrel关系抽取

    import torch
    import pandas as pd
    
    from ark_nlp.model.re.casrel_bert import CasRelBert
    from ark_nlp.model.re.casrel_bert import CasRelBertConfig
    from ark_nlp.model.re.casrel_bert import Dataset
    from ark_nlp.model.re.casrel_bert import Task
    from ark_nlp.model.re.casrel_bert import get_default_model_optimizer
    from ark_nlp.model.re.casrel_bert import Tokenizer
    from ark_nlp.factory.loss_function import CasrelLoss
    
    # 加载数据集
    # train_data_df的columns必选包含"text"和"label"
    # text列为文本
    # label列为列表形式,列表中每个元素是如下组织的字典
    # [头实体, 头实体首字符在文本的位置, 头实体尾字符在文本的位置, 关系类型, 尾实体, 尾实体首字符在文本的位置, 尾实体尾字符在文本的位置]
    re_train_dataset = Dataset(train_data_df)
    re_dev_dataset = Dataset(dev_data_df,
                             categories = re_train_dataset.categories,
                             is_train=False)
    
    # 加载分词器
    tokenizer = Tokenizer(vocab='nghuyong/ernie-1.0', max_seq_len=100)
    
    # 文本切分、ID化
    # 注意:casrel的代码这部分其实并没有进行切分、ID化,仅是将分词器赋予dataset对象
    re_train_dataset.convert_to_ids(tokenizer)
    re_dev_dataset.convert_to_ids(tokenizer)
    
    # 加载预训练模型
    config = CasRelBertConfig.from_pretrained('nghuyong/ernie-1.0',
                                              num_labels=len(re_train_dataset.cat2id))
    dl_module = CasRelBert.from_pretrained('nghuyong/ernie-1.0', 
                                           config=config)
    
    # 任务构建
    num_epoches = 40
    batch_size = 16
    optimizer = get_default_model_optimizer(dl_module)
    model = Task(dl_module, optimizer, CasrelLoss(), cuda_device=0)
    
    # 训练
    model.fit(re_train_dataset, 
              re_dev_dataset,
              lr=2e-5,
              epochs=5, 
              batch_size=batch_size
             )
    
    # 推断
    from ark_nlp.model.re.casrel_bert import Predictor
    
    casrel_re_predictor_instance = Predictor(model.module, tokenizer, re_train_dataset.cat2id)
    
    casrel_re_predictor_instance.predict_one_sample(待抽取文本)
  • PRGC关系抽取

    import torch
    import pandas as pd
    
    from ark_nlp.model.re.prgc_bert import PRGCBert
    from ark_nlp.model.re.prgc_bert import PRGCBertConfig
    from ark_nlp.model.re.prgc_bert import Dataset
    from ark_nlp.model.re.prgc_bert import Task
    from ark_nlp.model.re.prgc_bert import get_default_model_optimizer
    from ark_nlp.model.re.prgc_bert import Tokenizer
    
    # 加载数据集
    # train_data_df的columns必选包含"text"和"label"
    # text列为文本
    # label列为列表形式,列表中每个元素是如下组织的字典
    # [头实体, 头实体首字符在文本的位置, 头实体尾字符在文本的位置, 关系类型, 尾实体, 尾实体首字符在文本的位置, 尾实体尾字符在文本的位置]
    re_train_dataset = Dataset(train_df, is_retain_dataset=True)
    re_dev_dataset = Dataset(dev_df,
                             categories = re_train_dataset.categories,
                             is_train=False)
    
    # 加载分词器
    tokenizer = Tokenizer(vocab='nghuyong/ernie-1.0', max_seq_len=100)
    
    # 文本切分、ID化
    re_train_dataset.convert_to_ids(tokenizer)
    re_dev_dataset.convert_to_ids(tokenizer)
    
    # 加载预训练模型
    config = PRGCBertConfig.from_pretrained('nghuyong/ernie-1.0',
                                              num_labels=len(re_train_dataset.cat2id))
    dl_module = PRGCBert.from_pretrained('nghuyong/ernie-1.0', 
                                           config=config)
    
    # 任务构建
    num_epoches = 40
    batch_size = 16
    optimizer = get_default_model_optimizer(dl_module)
    model = Task(dl_module, optimizer, None, cuda_device=0)
    
    # 训练
    model.fit(re_train_dataset, 
              re_dev_dataset,
              lr=2e-5,
              epochs=5, 
              batch_size=batch_size
             )
    
    # 推断
    from ark_nlp.model.re.prgc_bert import Predictor
    
    prgc_re_predictor_instance = Predictor(model.module, tokenizer, re_train_dataset.cat2id)
    
    prgc_re_predictor_instance.predict_one_sample(待抽取文本)

DisscussionGroup

  • 公众号:DataArk

wechat

  • wechat ID: fk95624

Main contributors

xiangking/
xiangking
Jimme/
Jimme
Zrealshadow/
Zrealshadow

Acknowledge

本项目用于收集和复现学术与工作中常用的NLP模型,整合成方便调用的形式,所以参考借鉴了网上很多开源实现,如有不当的地方,还请联系批评指教。 在此,感谢大佬们的开源实现。

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].