All Projects → CLUEbenchmark → Cluecorpus2020

CLUEbenchmark / Cluecorpus2020

Large-scale Pre-training Corpus for Chinese 100G 中文预训练语料

Projects that are alternatives of or similar to Cluecorpus2020

Cluedatasetsearch
搜索所有中文NLP数据集,附常用英文NLP数据集
Stars: ✭ 2,112 (+659.71%)
Mutual labels:  chinese, datasets, corpus
Nlp chinese corpus
大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP
Stars: ✭ 6,656 (+2294.24%)
Mutual labels:  chinese, corpus
Cluepretrainedmodels
高质量中文预训练模型集合:最先进大模型、最快小模型、相似度专门模型
Stars: ✭ 493 (+77.34%)
Mutual labels:  chinese, corpus
Datasets
Poetry-related datasets developed by THUAIPoet (Jiuge) group.
Stars: ✭ 111 (-60.07%)
Mutual labels:  chinese, corpus
Chinese Nlp Corpus
Collections of Chinese NLP corpus
Stars: ✭ 438 (+57.55%)
Mutual labels:  datasets, corpus
Clue
中文语言理解测评基准 Chinese Language Understanding Evaluation Benchmark: datasets, baselines, pre-trained models, corpus and leaderboard
Stars: ✭ 2,425 (+772.3%)
Mutual labels:  chinese, corpus
CLUEmotionAnalysis2020
CLUE Emotion Analysis Dataset 细粒度情感分析数据集
Stars: ✭ 3 (-98.92%)
Mutual labels:  corpus, chinese
TV4Dialog
No description or website provided.
Stars: ✭ 33 (-88.13%)
Mutual labels:  corpus, chinese
open2ch-dialogue-corpus
おーぷん2ちゃんねるをクロールして作成した対話コーパス
Stars: ✭ 65 (-76.62%)
Mutual labels:  corpus, datasets
Weibo terminater
Final Weibo Crawler Scrap Anything From Weibo, comments, weibo contents, followers, anything. The Terminator
Stars: ✭ 2,295 (+725.54%)
Mutual labels:  chinese, corpus
CBLUE
中文医疗信息处理基准CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark
Stars: ✭ 379 (+36.33%)
Mutual labels:  corpus, chinese
OpenDialog
An Open-Source Package for Chinese Open-domain Conversational Chatbot (中文闲聊对话系统,一键部署微信闲聊机器人)
Stars: ✭ 94 (-66.19%)
Mutual labels:  corpus, chinese
dbcollection
A collection of popular datasets for deep learning.
Stars: ✭ 26 (-90.65%)
Mutual labels:  datasets
Xmorse
🌞 ~1.5Kb morse code library for all. 一个支持 Unicode 中文摩斯密码编码的 Javascript 库。
Stars: ✭ 266 (-4.32%)
Mutual labels:  chinese
sqlmap-wiki-zhcn
可能是最完整的 sqlmap 中文文档。
Stars: ✭ 51 (-81.65%)
Mutual labels:  chinese
awesome-hokchew
A curated list of resources about the Hokchew / Foochow language. 閩東語福州話的資源整合列表。
Stars: ✭ 16 (-94.24%)
Mutual labels:  chinese
Meglass
An eyeglass face dataset collected and cleaned for face recognition evaluation, CCBR 2018.
Stars: ✭ 281 (+1.08%)
Mutual labels:  datasets
Hub
Dataset format for AI. Build, manage, & visualize datasets for deep learning. Stream data real-time to PyTorch/TensorFlow & version-control it. https://activeloop.ai
Stars: ✭ 4,003 (+1339.93%)
Mutual labels:  datasets
newsletter-archive
Markdown archive & RSS/Atom feeds for Data Is Plural.
Stars: ✭ 65 (-76.62%)
Mutual labels:  datasets
Medical-Names-Corpus
医疗语料库。医疗机构名语料库。药品本位码。
Stars: ✭ 26 (-90.65%)
Mutual labels:  corpus

CLUECorpus2020

语料介绍

通过对Common Crawl的中文部分进行语料清洗,最终得到100GB的高质量中文预训练语料。实验产出的模型见:高质量中文预训练模型,大号、超小和相似度预训练模型。

更多细节请参考我们的技术报告 https://arxiv.org/pdf/2003.01355

./pics/corpus.png

数据特点:

  1. 可直接用于预训练、语言模型或语言生成任务。
  2. 发布专用于简体中文NLP任务的小词表。

词表介绍

Google原始中文词表和我们发布的小词表的统计信息如下:

Token Type Google CLUE
Simplified Chinese 11378 5689
Traditional Chinese 3264
English 3529 1320
Japanese 573
Korean 84
Emoji 56
Numbers 1179 140
Special Tokens 106 106
Other Tokens 959 766
Total 21128 8021

实验效果

使用小数据集在BERT-base上的效果对比:

Model Vocab Data Steps AFQMC TNEWS' IFLYTEK' CMNLI AVG
BERT-base Google Wiki (1 GB) 125K 69.93% 54.77% 57.54% 75.64% 64.47%
BERT-base Google C5 (1 GB) 125K 69.63% 55.72% 58.87% 75.75% 64.99%
BERT-base CLUE C5 (1 GB) 125K 69.00% 55.04% 59.07% 75.84% 64.74%
BERT-base mm Google C5 (1 GB) 125K 69.57% 55.17% 59.69% 75.86% 65.07%
BERT-base Google C5 (1 GB) 375K 69.85% 55.97% 59.62% 76.41% 65.46%
BERT-base CLUE C5 (1 GB) 375K 69.93% 56.38% 59.35% 76.58% 65.56%
BERT-base Google C5 (3 GB) 375K 70.22% 56.41% 59.58% 76.70% 65.73%
BERT-base CLUE C5 (3 GB) 375K 69.49% 55.97% 60.12% 77.66% 65.81%

更多实验结果和分析可以参考:CLUEPretrainedModels

数据下载

申请方式: 将使用语料研究目的和用途,计划、研究机构和申请者介绍,发送到邮箱,并承诺不向第三方提供。

邮箱: [email protected],标题是:CLUECorpus2020 100G语料库

CLUECorpusSmall(14G)

可用于语言建模、预训练或生成型任务等,数据量超过14G,近4000个定义良好的txt文件、50亿个字。主要部分来自于nlp_chinese_corpus项目

当前语料库按照【预训练格式】处理,内含有多个文件夹;每个文件夹有许多不超过4M大小的小文件,文件格式符合预训练格式:每句话一行,文档间空行隔开。

包含如下子语料库(总共14G语料):

1、新闻语料 news2016zh_corpus: 8G语料,分成两个上下两部分,总共有2000个小文件。 密码:mzlk

2、社区互动-语料 webText2019zh_corpus:3G语料,包含3G文本,总共有900多个小文件。 密码:qvlq

3、维基百科-语料 wiki2019zh_corpus:1.1G左右文本,包含300左右小文件。 密码:rja4

4、评论数据-语料 comments2019zh_corpus:2.3G左右文本,共784个小文件,包括点评评论547个、亚马逊评论227个,合并ChineseNLPCorpus的多个评论数据,清洗、格式转换、拆分成小文件。 密码:gc3m

反馈和支持

可以提交issue,加入讨论群(QQ:836811304)

或发送邮件 [email protected]

Research supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC)

引用

@article{CLUECorpus2020,
  title={CLUECorpus2020: A Large-scale Chinese Corpus for Pre-training Language Model},
  author={Liang Xu and Xuanwei Zhang and Qianqian Dong},
  journal={ArXiv},
  year={2020},
  volume={abs/2003.01355}
}

捐赠

CLUE是一个致力于中文自然语言处理的开源组织,如果您觉得我们的工作对您的学习或者业务等有帮助,希望能得到您的赞助,以便我们后续为大家提供更多更有用的开源工作,让我们一起为中文自然语言处理的发展和进步,尽一份力~

请备注捐赠者机构和姓名,非常感谢!

支付宝 微信
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].