NLP_PEMDC
NLP Predtrained Embeddings, Models and Datasets Collections(NLP_PEMDC)
The pretrained word embeddings and datasets for NLP. The collection will keep updating. The purpose of these pre-trained word vectors and datasets is for learning and research purposes only.
不断收集我遇到的各种NLP预训练词向量、模型和数据集。这些预训练词向量和数据集的目的仅用来学习和研究。
The rankings are in no particular order, only in the order I added them. The data set belongs to the original author, thanks! If there is any infringement, please email me and let me know.
排名不分先后,仅按我添加的先后顺序。数据集所有权均属于原作者,感谢!若有侵权,请电邮我告知删除。
Pretrained Chinese Word Vectors(embeddings):
Word2vec
-
100+ Chinese Word Vectors 上百种预训练中文词向量
-
Tencent AI Lab Embedding Corpus for Chinese Words and Phrases
GloVe
TODO
Chinese Pre-trained Models
-
Chinese-BERT
-
Chinese-BERT-wwm
-
Chinese-XLNet
-
Chinese-RoBERTa
-
Chinese-ALBERT
Chinese Courpus:
-
[集合]大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP
-
[集合]搜狗实验室语料集合
-
[集合]ChineseNlpCorpus
-
[集合]ChineseGLUE
目前包含:
-
COLING 2018
LCQMC 口语化描述的语义相似度任务 Semantic Similarity Task -
EMNLP 2015
XNLI 语言推断任务 Natural Language Inference -
TNEWS 今日头条中文新闻(短文本)分类 Short Text Classificaiton for News
-
INEWS 互联网情感分析任务 Sentiment Analysis for Internet News
-
THUCNEWS 长文本分类 Long Text classification
-
iFLYTEK 长文本分类 Long Text classification
-
DRCD 繁体阅读理解任务 Reading Comprehension for Traditional Chinese
-
CMRC2018 简体中文阅读理解任务 Reading Comprehension for Simplified Chinese
-
EMNLP 2018 Download
BQ 智能客服问句匹配 Question Matching for Customer Service -
MSRANER 命名实体识别 Name Entity Recognition
-
CHID 成语阅读理解填空 Chinese IDiom Dataset for Cloze Test
-
CMNLI 语言推理任务 Chinese Multi-Genre NLI
-
-
LCSTS: A Large Scale Chinese Short Text Summarization Dataset
大规模中文短文本摘要数据集
-
chinese-poetry: 最全中文诗歌古典文集数据库
-
SentiBridge: 中文实体情感知识库
English Corpus:
-
[collections]GLUE
Including:
- The Corpus of Linguistic Acceptability
- The Stanford Sentiment Treebank
- Microsoft Research Paraphrase Corpus
- Semantic Textual Similarity Benchmark
- Quora Question Pairs
- MultiNLI Matched
- MultiNLI Mismatched
- Question NLI
- Recognizing Textual Entailment
- Winograd NLI
- Diagnostics Main
-
[collections]SuperGLUE
Including:
- Broadcoverage Diagnostics
- CommitmentBank
- Choice of Plausible Alternatives
- Multi-Sentence Reading Comprehension
- Recognizing Textual Entailment
- Words in Context
- The Winograd Schema Challenge
- BoolQ
- Reading Comprehension with Commonsense Reasoning
- Winogender Schema Diagnostics
-
IMDB Large Movie Review Dataset
This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.
-
SQuAD2.0
The Stanford Question Answering Dataset