All Projects → imcaspar → Gpt2 Ml

imcaspar / Gpt2 Ml

Licence: apache-2.0
GPT2 for Multiple Languages, including pretrained models. GPT2 多语言支持, 15亿参数中文预训练模型

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Gpt2 Ml

Gpt2 Chinese
Chinese version of GPT2 training code, using BERT tokenizer.
Stars: ✭ 4,592 (+330.77%)
Mutual labels:  chinese, text-generation
Clue
中文语言理解测评基准 Chinese Language Understanding Evaluation Benchmark: datasets, baselines, pre-trained models, corpus and leaderboard
Stars: ✭ 2,425 (+127.49%)
Mutual labels:  chinese, pretrained-models
Crslab
CRSLab is an open-source toolkit for building Conversational Recommender System (CRS).
Stars: ✭ 183 (-82.83%)
Mutual labels:  pretrained-models, text-generation
Gpt2 Newstitle
Chinese NewsTitle Generation Project by GPT2.带有超级详细注释的中文GPT2新闻标题生成项目。
Stars: ✭ 235 (-77.95%)
Mutual labels:  chinese, text-generation
gap-text2sql
GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training
Stars: ✭ 83 (-92.21%)
Mutual labels:  text-generation, pretrained-models
AiSpace
AiSpace: Better practices for deep learning model development and deployment For Tensorflow 2.0
Stars: ✭ 28 (-97.37%)
Mutual labels:  chinese, pretrained-models
Awesome Pretrained Chinese Nlp Models
Awesome Pretrained Chinese NLP Models,高质量中文预训练模型集合
Stars: ✭ 195 (-81.71%)
Mutual labels:  chinese, pretrained-models
Text-Generate-RNN
中国古诗生成(文本生成)
Stars: ✭ 106 (-90.06%)
Mutual labels:  text-generation, chinese
Cluepretrainedmodels
高质量中文预训练模型集合:最先进大模型、最快小模型、相似度专门模型
Stars: ✭ 493 (-53.75%)
Mutual labels:  chinese, pretrained-models
Cnn Question Classification Keras
Chinese Question Classifier (Keras Implementation) on BQuLD
Stars: ✭ 28 (-97.37%)
Mutual labels:  chinese
Cv Pretrained Model
A collection of computer vision pre-trained models.
Stars: ✭ 995 (-6.66%)
Mutual labels:  pretrained-models
Awesome Go Zh
📚 Go资源精选中文版(含中文图书大全)
Stars: ✭ 887 (-16.79%)
Mutual labels:  chinese
Rssbot
Lightweight Telegram RSS bot for notifications only. 用于消息通知的轻量级 Telegram RSS 机器人
Stars: ✭ 952 (-10.69%)
Mutual labels:  chinese
Describing a knowledge base
Code for Describing a Knowledge Base
Stars: ✭ 42 (-96.06%)
Mutual labels:  text-generation
Chinese Poetry
The most comprehensive database of Chinese poetry 🧶最全中华古诗词数据库, 唐宋两朝近一万四千古诗人, 接近5.5万首唐诗加26万宋诗. 两宋时期1564位词人,21050首词。
Stars: ✭ 34,881 (+3172.14%)
Mutual labels:  chinese
Awesome Gameserver Cn
中文游戏服务器资源大全
Stars: ✭ 1,038 (-2.63%)
Mutual labels:  chinese
Asteroid
The PyTorch-based audio source separation toolkit for researchers
Stars: ✭ 862 (-19.14%)
Mutual labels:  pretrained-models
Classification models
Classification models trained on ImageNet. Keras.
Stars: ✭ 938 (-12.01%)
Mutual labels:  pretrained-models
Pime
Develop input methods for Windows easily with Python and node.js
Stars: ✭ 1,051 (-1.41%)
Mutual labels:  chinese
Trime
同文安卓輸入法平臺3.x/Android-rime/Rime Input Method Engine for Android
Stars: ✭ 1,032 (-3.19%)
Mutual labels:  chinese

GPT2 for Multiple Languages

Open In Colab GitHub GitHub All Releases contributions welcome GitHub stars

中文说明 | English

  • [x] Simplifed GPT2 train scripts(based on Grover, supporting TPUs)
  • [x] Ported bert tokenizer, multilingual corpus compatible
  • [x] 1.5B GPT2 pretrained Chinese model ( ~15G corpus, 10w steps )
  • [x] Batteries-included Colab demo #
  • [x] 1.5B GPT2 pretrained Chinese model ( ~30G corpus, 22w steps )

Pretrained Model

Size Language Corpus Vocab Link1 Link2 SHA256
1.5B Params Chinese ~30G CLUE ( 8021 tokens ) Google Drive Baidu Pan (ffz6) e698cc97a7f5f706f84f58bb469d614e
51d3c0ce5f9ab9bf77e01e3fcb41d482
1.5B Params Chinese ~15G Bert ( 21128 tokens ) Google Drive Baidu Pan (q9vr) 4a6e5124df8db7ac2bdd902e6191b807
a6983a7f5d09fb10ce011f9a073b183e

Corpus from THUCNews and nlp_chinese_corpus

Using Cloud TPU Pod v3-256 to train 22w steps

loss

Google Colab

With just 2 clicks (not including Colab auth process), the 1.5B pretrained Chinese model demo is ready to go:

[Colab Notebook]

Train

Disclaimer

The contents in this repository are for academic research purpose, and we do not provide any conclusive remarks.

Citation

@misc{GPT2-ML,
  author = {Zhibo Zhang},
  title = {GPT2-ML: GPT-2 for Multiple Languages},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/imcaspar/gpt2-ml}},
}

Reference

https://github.com/google-research/bert

https://github.com/rowanz/grover

Research supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC)

Press

[机器之心] 只需单击三次,让中文GPT-2为你生成定制故事

[科学空间] 现在可以用Keras玩中文GPT2了

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].