SKT-AI / Kogpt2
Programming Languages
Labels
Projects that are alternatives of or similar to Kogpt2
KoGPT2 (νκ΅μ΄ GPT-2)
Why'?'
- OpenAI GPT-2 λͺ¨λΈμ νκ΅μ΄ μ±λ₯ νκ³
Model
-
GPT-2 base
λͺ¨λΈ
GPT2Model(units=768,
max_length=1024,
num_heads=12,
num_layers=12,
dropout=0.1,
vocab_size=50000)
-
Fused GELU
λ₯Ό κΈ°λ°μΌλ‘ 10% μ΄μμ νμ΅ λ° μΆλ‘ μλ ν₯μ
Tokenizer
- 2μ² 5λ°±λ§ μ΄μμ λ¬Έμ₯μΌλ‘ νμ΅(wiki + news)
- BPE(Byte Pair Encoding)
- 50,000 ν ν°
Data
Data | # of Sentences | # of Words |
---|---|---|
Korean Wiki | 5M | 54M |
Korean News | 120M | 1.6B |
Other corpus | 9.4M, 18M | 88M, 82M |
- μμ λ¬Έμ₯ (Raw text) κΈ°μ€ μ½ 20GBμ λ°μ΄ν° μ¬μ©
Training
- μ΄ μμ
μ μλ μΈ νκ³Όμ νμ
μΌλ‘ μ§νλμμ΅λλ€.
-
SKT Conv.AI
ν : λκ·λͺ¨ μΈμ΄λͺ¨λΈ νμ΅ λ‘μ§ κ΅¬ν -
Amazon Machine Learning Solutions Lab
ν : λκ·λͺ¨ λΆμ° νμ΅ μΈνλΌ κ΅¬μ± -
GluonNLP
ν : νμ΅ νΌν¬λ¨Όμ€ κ°μ
-
How to install
git clone https://github.com/SKT-AI/KoGPT2.git
cd KoGPT2
pip install -r requirements.txt
pip install .
Requirements
- Python >= 3.6
- PyTorch == 1.5.0
- MXNet == 1.6.0
- onnxruntime == 1.5.2
- gluonnlp == 0.9.1
- sentencepiece >= 0.1.85
- transformers == 2.11.0
How to use
PyTorch
-
2019λ νν΄λ₯Ό 보λ΄λ©°,
λ‘ λ¬Έμ₯μ μμ±ν΄λ΄λ κ°λ¨ν μμ
import torch
from kogpt2.pytorch_kogpt2 import get_pytorch_kogpt2_model
from gluonnlp.data import SentencepieceTokenizer
from kogpt2.utils import get_tokenizer
tok_path = get_tokenizer()
model, vocab = get_pytorch_kogpt2_model()
tok = SentencepieceTokenizer(tok_path, num_best=0, alpha=0)
sent = '2019λ
νν΄λ₯Ό 보λ΄λ©°,'
toked = tok(sent)
while 1:
input_ids = torch.tensor([vocab[vocab.bos_token],] + vocab[toked]).unsqueeze(0)
pred = model(input_ids)[0]
gen = vocab.to_tokens(torch.argmax(pred, axis=-1).squeeze().tolist())[-1]
if gen == '</s>':
break
sent += gen.replace('β', ' ')
toked = tok(sent)
sent
'2019λ
νν΄λ₯Ό 보λ΄λ©°, μν΄μλ λ λ§μ μ¬λλ€μ΄ μν΄μ μ΄λ£¨κ³ μ νλ μλ§κ³Ό ν¬λ§μ λμ겨보λ μκ°μ΄ λμμΌλ©΄ μ’κ² λ€.'
model
μ λν΄νΈλ‘ eval()
λͺ¨λλ‘ λ¦¬ν΄λ¨, λ°λΌμ νμ΅ μ©λλ‘ μ¬μ©μ model.train()
λͺ
λ Ήμ ν΅ν΄ νμ΅ λͺ¨λλ‘ λ³κ²½ν νμκ° μλ€.
MXNet-Gluon
import mxnet as mx
from kogpt2.mxnet_kogpt2 import get_mxnet_kogpt2_model
from gluonnlp.data import SentencepieceTokenizer
from kogpt2.utils import get_tokenizer
if mx.context.num_gpus() > 0:
ctx = mx.gpu()
else:
ctx = mx.cpu()
tok_path = get_tokenizer()
model, vocab = get_mxnet_kogpt2_model(ctx=ctx)
tok = SentencepieceTokenizer(tok_path, num_best=0, alpha=0)
sent = '2019λ
νν΄λ₯Ό 보λ΄λ©°,'
toked = tok(sent)
while 1:
input_ids = mx.nd.array([vocab[vocab.bos_token]] + vocab[toked]).expand_dims(axis=0)
pred = model(input_ids.as_in_context(ctx))[0]
gen = vocab.to_tokens(mx.nd.argmax(pred, axis=-1).squeeze().astype('int').asnumpy().tolist())[-1]
if gen == '</s>':
break
sent += gen.replace('β', ' ')
toked = tok(sent)
sent
'2019λ
νν΄λ₯Ό 보λ΄λ©°, μν΄μλ λ λ§μ μ¬λλ€μ΄ μν΄μ μ΄λ£¨κ³ μ νλ μλ§κ³Ό ν¬λ§μ λμ겨보λ μκ°μ΄ λμμΌλ©΄ μ’κ² λ€.'
ONNX
python onnx/export_onnx_kogpt2.py
import torch
import numpy as np
from kogpt2.pytorch_kogpt2 import get_pytorch_kogpt2_model
from gluonnlp.data import SentencepieceTokenizer
from kogpt2.utils import get_tokenizer
import onnxruntime
tok_path = get_tokenizer()
_, vocab = get_pytorch_kogpt2_model()
model = onnxruntime.InferenceSession("./onnx/pytorch_kogpt2_676e9bcfa7.onnx")
tok = SentencepieceTokenizer(tok_path, num_best=0, alpha=0)
sent = '2019λ
νν΄λ₯Ό 보λ΄λ©°,'
toked = tok(sent)
while 1:
input_ids = torch.tensor([vocab[vocab.bos_token],] + vocab[toked]).unsqueeze(0)
pred = model.run(None, {'input_ids': np.array(input_ids)})[0]
gen = vocab.to_tokens(torch.argmax(torch.tensor(pred), axis=-1).squeeze().tolist())[-1]
if gen == '</s>':
break
sent += gen.replace('β', ' ')
toked = tok(sent)
sent
'2019λ
νν΄λ₯Ό 보λ΄λ©°, μν΄μλ λ λ§μ μ¬λλ€μ΄ μν΄μ μ΄λ£¨κ³ μ νλ μλ§κ³Ό ν¬λ§μ λμ겨보λ μκ°μ΄ λμμΌλ©΄ μ’κ² λ€.'
How to deploy the pre-trained KoGPT-2 model to Amazon SageMaker
Pre-trained λλ fine-tuning KoGPT2 λͺ¨λΈμ μΆλ‘ APIλ₯Ό λ§λλ λ°©λ²μ΄ κΆκΈνμλ©΄, AWS νκ΅ λΈλ‘κ·Έ Amazon SageMakerμ MXNet μΆλ‘ 컨ν μ΄λλ₯Ό νμ©ν KoGPT2 λͺ¨λΈ λ°°ν¬νκΈ° λλ aws-samples Git repoλ₯Ό μ°Έκ³ ν΄μ£ΌμΈμ.
Demo
KoGPT2-Explorer
Subtask Evaluations
Sentimemt Analysis
Model | Test Accuracy |
---|---|
BERT base multilingual cased | 0.875 |
KoBERT | 0.901 |
KoGPT2 | 0.899 |
Paraphrase Detection
νκ΅μ΄ Paraphrase Detection λ°μ΄ν°
Model | Test Accuracy |
---|---|
KoBERT | 0.911 |
KoGPT2 | 0.943 |
Examples
- νκΈ μ±λ΄(chit-chat) λͺ¨λΈ
- μμ€ μ°λ λͺ¨λΈ(NarrativeKoGPT2)
- κ°μ¬ μ°λ λͺ¨λΈ
Contacts
KoGPT2
κ΄λ ¨ μ΄μλ μ΄κ³³μ μ¬λ €μ£ΌμΈμ.
License
KoGPT2
λ modified MIT
λΌμ΄μ μ€ νμ 곡κ°λμ΄ μμ΅λλ€. λͺ¨λΈ λ° μ½λλ₯Ό μ¬μ©ν κ²½μ° λΌμ΄μ μ€ λ΄μ©μ μ€μν΄μ£ΌμΈμ. λΌμ΄μ μ€ μ λ¬Έμ LICENSE
νμΌμμ νμΈνμ€ μ μμ΅λλ€.