All Projects → CLUEbenchmark → Cluepretrainedmodels

CLUEbenchmark / Cluepretrainedmodels

高质量中文预训练模型集合:最先进大模型、最快小模型、相似度专门模型

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Cluepretrainedmodels

Clue
中文语言理解测评基准 Chinese Language Understanding Evaluation Benchmark: datasets, baselines, pre-trained models, corpus and leaderboard
Stars: ✭ 2,425 (+391.89%)
Mutual labels:  chinese, dataset, corpus, pretrained-models
Nlp chinese corpus
大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP
Stars: ✭ 6,656 (+1250.1%)
Mutual labels:  chinese, dataset, corpus, text-classification
Cluedatasetsearch
搜索所有中文NLP数据集,附常用英文NLP数据集
Stars: ✭ 2,112 (+328.4%)
Mutual labels:  chinese, corpus, text-classification
CBLUE
中文医疗信息处理基准CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark
Stars: ✭ 379 (-23.12%)
Mutual labels:  corpus, chinese
BERT-chinese-text-classification-pytorch
This repo contains a PyTorch implementation of a pretrained BERT model for text classification.
Stars: ✭ 92 (-81.34%)
Mutual labels:  text-classification, chinese
text-classification-cn
中文文本分类实践,基于搜狗新闻语料库,采用传统机器学习方法以及预训练模型等方法
Stars: ✭ 81 (-83.57%)
Mutual labels:  text-classification, corpus
Awesome Pretrained Chinese Nlp Models
Awesome Pretrained Chinese NLP Models,高质量中文预训练模型集合
Stars: ✭ 195 (-60.45%)
Mutual labels:  chinese, pretrained-models
OpenDialog
An Open-Source Package for Chinese Open-domain Conversational Chatbot (中文闲聊对话系统,一键部署微信闲聊机器人)
Stars: ✭ 94 (-80.93%)
Mutual labels:  corpus, chinese
Pytorch-NLU
Pytorch-NLU,一个中文文本分类、序列标注工具包,支持中文长文本、短文本的多类、多标签分类任务,支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech ta…
Stars: ✭ 151 (-69.37%)
Mutual labels:  text-classification, pretrained-models
Filipino-Text-Benchmarks
Open-source benchmark datasets and pretrained transformer models in the Filipino language.
Stars: ✭ 22 (-95.54%)
Mutual labels:  text-classification, corpus
Fakenewscorpus
A dataset of millions of news articles scraped from a curated list of data sources.
Stars: ✭ 255 (-48.28%)
Mutual labels:  dataset, corpus
AiSpace
AiSpace: Better practices for deep learning model development and deployment For Tensorflow 2.0
Stars: ✭ 28 (-94.32%)
Mutual labels:  chinese, pretrained-models
Cnn Text Classification Tf Chinese
CNN for Chinese Text Classification in Tensorflow
Stars: ✭ 237 (-51.93%)
Mutual labels:  chinese, text-classification
TV4Dialog
No description or website provided.
Stars: ✭ 33 (-93.31%)
Mutual labels:  corpus, chinese
Weibo terminater
Final Weibo Crawler Scrap Anything From Weibo, comments, weibo contents, followers, anything. The Terminator
Stars: ✭ 2,295 (+365.52%)
Mutual labels:  chinese, corpus
CLUEmotionAnalysis2020
CLUE Emotion Analysis Dataset 细粒度情感分析数据集
Stars: ✭ 3 (-99.39%)
Mutual labels:  corpus, chinese
Medical-Names-Corpus
医疗语料库。医疗机构名语料库。药品本位码。
Stars: ✭ 26 (-94.73%)
Mutual labels:  corpus, dataset
Chinese Text Classification
Chinese-Text-Classification,Tensorflow CNN(卷积神经网络)实现的中文文本分类。QQ群:522785813,微信群二维码:http://www.tensorflownews.com/
Stars: ✭ 284 (-42.39%)
Mutual labels:  chinese, text-classification
Text Classification Cnn Rnn
CNN-RNN中文文本分类,基于TensorFlow
Stars: ✭ 3,613 (+632.86%)
Mutual labels:  chinese, text-classification
Species-Names-Corpus
物种名称语料库。植物名,动物名。
Stars: ✭ 23 (-95.33%)
Mutual labels:  corpus, dataset

CLUE Pretrained Models

高质量中文预训练模型集合:

最先进的大模型、速度最快和效果好的小模型、面向相似性或句子对任务优化的专门模型。

更多细节请参考我们的技术报告 https://arxiv.org/pdf/2003.01355

./pics/corpus.png

We introduce the Chinese corpus from CLUE organization, CLUECorpus2020, a large-scale corpus that can be used directly for self-supervised learning such as pretraining of a language model, or language generation. It has 100G raw corpus with 35 billion Chinese characters, which is retrieved from Common Crawl.

To better understand this corpus, we conduct language understanding experiments on both small and large scale, and results show that the models trained on this corpus can achieve excellent performance on Chinese. We release a new Chinese vocabulary(vocab clue) with a size of 8K, which is only one-third of the vocabulary size used in Chinese Bert released by Google. It saves computational cost and memory while works as good as original vocabulary. We also release both large and tiny versions of the pre-trained model on this corpus. The former achieves the state-of-the-art result, and the latter retains most precision while accelerating training and prediction speed for eight times compared to Bert-base.

介绍

本项目是与CLUECorpus2020的姊妹项目,通过使用前者的预训练语料库和新版的词汇表,来做模型的预训练。详细报告见,技术报告

项目亮点:

1.提供了大模型、小模型和语义相似度模型。大模型取得与当前中文上效果最佳的模型一致的效果,在一些任务上效果更好。

2.小模型速度比Bert-base提升8倍左右,与albert_tiny速度一致,但效果更佳;

3.语义相似度模型,用于处理语义相似度或句子对问题,有很大概率比直接用预训练模型效果要好;

4.一期支持6个分类和句子对任务,会支持CLUE benchmark所有任务;

模型下载

模型简称 参数量 存储大小 语料 词汇表 直接下载
RoBERTa-tiny-clue
超小模型
7.5M 28.3M CLUECorpus2020 CLUEVocab TensorFlow
PyTorch (提取码:8qvb)
RoBERTa-tiny-pair
超小句子对模型
7.5M 28.3M CLUECorpus2020 CLUEVocab TensorFlow
PyTorch (提取码:8qvb)
RoBERTa-tiny3L768-clue
小模型
38M 110M CLUECorpus2020 CLUEVocab TensorFlow
RoBERTa-tiny3L312-clue
小模型
<7.5M 24M CLUECorpus2020 CLUEVocab TensorFlow
RoBERTa-large-clue
大模型
290M 1.20G CLUECorpus2020 CLUEVocab TensorFlow
PyTorch (提取码:8qvb)
RoBERTa-large-pair
大句子对模型
290M 1.20G CLUECorpus2020 CLUEVocab TensorFlow
PyTorch (提取码:8qvb)

快速加载

依托于Huggingface-Transformers 2.5.1,可轻松调用以上模型。

tokenizer = BertTokenizer.from_pretrained("MODEL_NAME")
model = BertModel.from_pretrained("MODEL_NAME")

其中MODEL_NAME对应列表如下:

模型名 MODEL_NAME
RoBERTa-tiny-clue clue/roberta_chinese_clue_tiny
RoBERTa-tiny-pair clue/roberta_chinese_pair_tiny
RoBERTa-tiny3L768-clue clue/roberta_chinese_3L768_clue_tiny
RoBERTa-tiny3L312-clue clue/roberta_chinese_3L312_clue_tiny
RoBERTa-large-clue clue/roberta_chinese_clue_large
RoBERTa-large-pair clue/roberta_chinese_pair_large

中文任务基准测评.分类与句子对任务

AFQMC:语义相似度任务
TNEWS':中文新闻(短文本)分类。包含15个类别的新闻,包括旅游,教育,金融,军事等。
IFLYTEK':关于app应用描述的长文本数据,包含和日常生活相关的各类应用主题,共119个类别,如:打车、地图导航、免费WIFI、经营等
CMNLI:自然语言推理任务,判断给定的两个句子之间的关系,如蕴涵、中立、矛盾。

效果对比-小模型

模型 Score 参数 AFQMC TNEWS' IFLYTEK' CMNLI
BERT-base (baseline) 67.57% 108M 73.70% 56.58% 60.29% 79.69%
ALBERT-tiny 60.65% 4M 69.92% 53.35% 48.71% 70.61%
RoBERTa-tiny-clue 63.60% 7.5M 69.52% 54.57% 57.31% 73.10%

效果对比-大模型

模型 Score 参数 AFQMC TNEWS' IFLYTEK' CMNLI
BERT-base (baseline) 67.57% 108M 73.70% 56.58% 60.29% 79.69%
RoBERTa-wwm-large 69.46% 325M 74.44% 58.41% 62.77% 82.20%
RoBERTa-large-clue 69.68% 290M 74.41% 58.38% 63.58% 82.36%

效果对比-句子对模型

模型 Score 参数 AFQMC
RoBERTa-tiny-clue 69.52% 7.5M 69.52%
RoBERTa-tiny-pair 69.93%(↑0.41) 7.5M 69.93 %
RoBERTa-large-clue 74.00% 92M 74.00%
RoBERTa-large-pair 74.41%(↑0.41) 92M 74.41%

速度对比

模型 词汇表 词汇表大小 参数量 训练设备 训练速度
BERT-base google_vocab 21128 102M TPU V3-8 1k steps/404s
BERT-base clue_vocab 8021(↓62.04%) 92M(↓9.80%) TPU V3-8 1k steps/350s(↑15.43%)
RoBERTa-tiny-clue clue_vocab 8021(↓62.05%) 7.5M(↓92.6%) TPU V3-8 1k steps/50s(↑708.0%)

小模型使用建议

1.学习率:稍微大一点的学习率,如{1e4, 4e-4 1e-5} 默认:1e-4

2.训练轮次:5-8。 使用验证集上效果最好的模型,用于测试集上测试或在线预测

3.相似性或句子对任务,优先使用专门的RoBERTa-xxx-pair模型,如RoBERTa-tiny-pair(小号)或 RoBERTa-large-pair(大号)

模型结构

为方便调用,所有模型都保持和Bert-base一致的结构,并可以直接使用Bert加载。
RoBERTa-xxx-clue.zip
    |- bert_model.ckpt      # 模型权重
    |- bert_model.meta      # 模型meta信息
    |- bert_model.index     # 模型index信息
    |- bert_config.json     # 模型参数
    |- vocab.txt            # 词表

一键运行.基线模型与代码 Baseline with codes

使用方式:
1、克隆项目 
   git clone https://github.com/CLUEbenchmark/CLUEPretrainedModels.git
2、进入到相应的目录
   分类任务  
       例如:
       cd CLUEPretrainedModels/baselines/models/bert
       ### cd CLUEPretrainedModels/baselines/models_pytorch/classifier_pytorch
3、运行一键运行脚本(GPU方式): 会自动下载模型和所有任务数据并开始运行。
   bash run_classifier_clue.sh
   执行该一键运行脚本将会自动下载所有任务数据,并为所有任务找到最优模型,然后测试得到提交结果
4、tpu使用方式(可选)  
    cd CLUEPretrainedModels/baselines/models/bert/tpu  
    sh run_classifier_tnews.sh即可测试tnews任务(注意更换里面的gs路径和tpu ip)。数据和模型会自动下载和上传。
    
    cd CLUEPretrainedModels/baselines/models/roberta/tpu  
    sh run_classifier_tiny.sh即可运行所有分类任务(注意更换里面的路径,模型地址和tpu ip)  

预训练

小文件预训练示例 需要下载一个模型文件,并将模型文件的地址修改到以下的bash文件中。最后训练结果会保存在当前目录下的test文件夹中.

bash create_sample_zh.sh
bash run_sample.sh

结果在MLM上应该是1.0的 acc。(NSP任务没有训练)

问题反馈和支持

如有问题请提交issue,加入讨论群(QQ:836811304)

或发送邮件[email protected]

Computation Power

Research supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC)

TODO LIST:

  1. PyTorch版本
  2. 在线预测代码
  3. 一键运行6大分类任务
  4. 大模型去掉无效参数

Reference:

  1. 关于使用的任务的介绍:AFQMC(句子对), TNEWS'(情感分析), IFLYTEK'(100+类别的分类), CMNLI(自然语言推理)
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].