imcaspar / Gpt2 Ml
Licence: apache-2.0
GPT2 for Multiple Languages, including pretrained models. GPT2 多语言支持, 15亿参数中文预训练模型
Stars: ✭ 1,066
Programming Languages
python
139335 projects - #7 most used programming language
Projects that are alternatives of or similar to Gpt2 Ml
Gpt2 Chinese
Chinese version of GPT2 training code, using BERT tokenizer.
Stars: ✭ 4,592 (+330.77%)
Mutual labels: chinese, text-generation
Clue
中文语言理解测评基准 Chinese Language Understanding Evaluation Benchmark: datasets, baselines, pre-trained models, corpus and leaderboard
Stars: ✭ 2,425 (+127.49%)
Mutual labels: chinese, pretrained-models
Crslab
CRSLab is an open-source toolkit for building Conversational Recommender System (CRS).
Stars: ✭ 183 (-82.83%)
Mutual labels: pretrained-models, text-generation
Gpt2 Newstitle
Chinese NewsTitle Generation Project by GPT2.带有超级详细注释的中文GPT2新闻标题生成项目。
Stars: ✭ 235 (-77.95%)
Mutual labels: chinese, text-generation
gap-text2sql
GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training
Stars: ✭ 83 (-92.21%)
Mutual labels: text-generation, pretrained-models
AiSpace
AiSpace: Better practices for deep learning model development and deployment For Tensorflow 2.0
Stars: ✭ 28 (-97.37%)
Mutual labels: chinese, pretrained-models
Awesome Pretrained Chinese Nlp Models
Awesome Pretrained Chinese NLP Models,高质量中文预训练模型集合
Stars: ✭ 195 (-81.71%)
Mutual labels: chinese, pretrained-models
Cluepretrainedmodels
高质量中文预训练模型集合:最先进大模型、最快小模型、相似度专门模型
Stars: ✭ 493 (-53.75%)
Mutual labels: chinese, pretrained-models
Cnn Question Classification Keras
Chinese Question Classifier (Keras Implementation) on BQuLD
Stars: ✭ 28 (-97.37%)
Mutual labels: chinese
Cv Pretrained Model
A collection of computer vision pre-trained models.
Stars: ✭ 995 (-6.66%)
Mutual labels: pretrained-models
Rssbot
Lightweight Telegram RSS bot for notifications only. 用于消息通知的轻量级 Telegram RSS 机器人
Stars: ✭ 952 (-10.69%)
Mutual labels: chinese
Describing a knowledge base
Code for Describing a Knowledge Base
Stars: ✭ 42 (-96.06%)
Mutual labels: text-generation
Chinese Poetry
The most comprehensive database of Chinese poetry 🧶最全中华古诗词数据库, 唐宋两朝近一万四千古诗人, 接近5.5万首唐诗加26万宋诗. 两宋时期1564位词人,21050首词。
Stars: ✭ 34,881 (+3172.14%)
Mutual labels: chinese
Asteroid
The PyTorch-based audio source separation toolkit for researchers
Stars: ✭ 862 (-19.14%)
Mutual labels: pretrained-models
Classification models
Classification models trained on ImageNet. Keras.
Stars: ✭ 938 (-12.01%)
Mutual labels: pretrained-models
Pime
Develop input methods for Windows easily with Python and node.js
Stars: ✭ 1,051 (-1.41%)
Mutual labels: chinese
Trime
同文安卓輸入法平臺3.x/Android-rime/Rime Input Method Engine for Android
Stars: ✭ 1,032 (-3.19%)
Mutual labels: chinese
GPT2 for Multiple Languages
- [x] Simplifed GPT2 train scripts(based on Grover, supporting TPUs)
- [x] Ported bert tokenizer, multilingual corpus compatible
- [x] 1.5B GPT2 pretrained Chinese model ( ~15G corpus, 10w steps )
- [x] Batteries-included Colab demo #
- [x] 1.5B GPT2 pretrained Chinese model ( ~30G corpus, 22w steps )
Pretrained Model
Size | Language | Corpus | Vocab | Link1 | Link2 | SHA256 |
---|---|---|---|---|---|---|
1.5B Params | Chinese | ~30G | CLUE ( 8021 tokens ) | Google Drive | Baidu Pan (ffz6) | e698cc97a7f5f706f84f58bb469d614e 51d3c0ce5f9ab9bf77e01e3fcb41d482 |
1.5B Params | Chinese | ~15G | Bert ( 21128 tokens ) | Google Drive | Baidu Pan (q9vr) | 4a6e5124df8db7ac2bdd902e6191b807 a6983a7f5d09fb10ce011f9a073b183e |
Corpus from THUCNews and nlp_chinese_corpus
Using Cloud TPU Pod v3-256 to train 22w steps
Google Colab
With just 2 clicks (not including Colab auth process), the 1.5B pretrained Chinese model demo is ready to go:

Train
Disclaimer
The contents in this repository are for academic research purpose, and we do not provide any conclusive remarks.
Citation
@misc{GPT2-ML,
author = {Zhibo Zhang},
title = {GPT2-ML: GPT-2 for Multiple Languages},
year = {2019},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/imcaspar/gpt2-ml}},
}
Reference
https://github.com/google-research/bert
https://github.com/rowanz/grover
Research supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC)
Press
Note that the project description data, including the texts, logos, images, and/or trademarks,
for each open source project belongs to its rightful owner.
If you wish to add or remove any projects, please contact us at [email protected].