All Projects β†’ SKT-AI β†’ Kogpt2

SKT-AI / Kogpt2

Licence: other
Korean GPT-2 pretrained cased (KoGPT2)

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Kogpt2

CodeT5
Code for CodeT5: a new code-aware pre-trained encoder-decoder model.
Stars: ✭ 390 (+5.98%)
Mutual labels:  language-model
python-arpa
🐍 Python library for n-gram models in ARPA format
Stars: ✭ 35 (-90.49%)
Mutual labels:  language-model
Transfer Nlp
NLP library designed for reproducible experimentation management
Stars: ✭ 287 (-22.01%)
Mutual labels:  language-model
MinTL
MinTL: Minimalist Transfer Learning for Task-Oriented Dialogue Systems
Stars: ✭ 61 (-83.42%)
Mutual labels:  language-model
SDLM-pytorch
Code accompanying EMNLP 2018 paper Language Modeling with Sparse Product of Sememe Experts
Stars: ✭ 27 (-92.66%)
Mutual labels:  language-model
few-shot-lm
The source code of "Language Models are Few-shot Multilingual Learners" (MRL @ EMNLP 2021)
Stars: ✭ 32 (-91.3%)
Mutual labels:  language-model
Word-Prediction-Ngram
Next Word Prediction using n-gram Probabilistic Model with various Smoothing Techniques
Stars: ✭ 25 (-93.21%)
Mutual labels:  language-model
Trankit
Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing
Stars: ✭ 311 (-15.49%)
Mutual labels:  language-model
DataAugmentationNMT
Data Augmentation for Neural Machine Translation
Stars: ✭ 26 (-92.93%)
Mutual labels:  language-model
Bertweet
BERTweet: A pre-trained language model for English Tweets (EMNLP-2020)
Stars: ✭ 282 (-23.37%)
Mutual labels:  language-model
pyVHDLParser
Streaming based VHDL parser.
Stars: ✭ 51 (-86.14%)
Mutual labels:  language-model
minicons
Utility for analyzing Transformer based representations of language.
Stars: ✭ 28 (-92.39%)
Mutual labels:  language-model
A Pytorch Tutorial To Sequence Labeling
Empower Sequence Labeling with Task-Aware Neural Language Model | a PyTorch Tutorial to Sequence Labeling
Stars: ✭ 257 (-30.16%)
Mutual labels:  language-model
tying-wv-and-wc
Implementation for "Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling"
Stars: ✭ 39 (-89.4%)
Mutual labels:  language-model
Xlnet Pytorch
An implementation of Google Brain's 2019 XLNet in PyTorch
Stars: ✭ 304 (-17.39%)
Mutual labels:  language-model
gpt-j
A GPT-J API to use with python3 to generate text, blogs, code, and more
Stars: ✭ 101 (-72.55%)
Mutual labels:  language-model
Chinese-Word-Segmentation-in-NLP
State of the art Chinese Word Segmentation with Bi-LSTMs
Stars: ✭ 23 (-93.75%)
Mutual labels:  language-model
Azureml Bert
End-to-End recipes for pre-training and fine-tuning BERT using Azure Machine Learning Service
Stars: ✭ 342 (-7.07%)
Mutual labels:  language-model
Gpt Neox
An implementation of model parallel GPT-3-like models on GPUs, based on the DeepSpeed library. Designed to be able to train models in the hundreds of billions of parameters or larger.
Stars: ✭ 303 (-17.66%)
Mutual labels:  language-model
Bluebert
BlueBERT, pre-trained on PubMed abstracts and clinical notes (MIMIC-III).
Stars: ✭ 273 (-25.82%)
Mutual labels:  language-model

KoGPT2 (ν•œκ΅­μ–΄ GPT-2)

Why'?'

  • OpenAI GPT-2 λͺ¨λΈμ˜ ν•œκ΅­μ–΄ μ„±λŠ₯ ν•œκ³„

Model

  • GPT-2 base λͺ¨λΈ
GPT2Model(units=768,
    max_length=1024,
    num_heads=12,
    num_layers=12,
    dropout=0.1,
    vocab_size=50000)
  • Fused GELUλ₯Ό 기반으둜 10% μ΄μƒμ˜ ν•™μŠ΅ 및 μΆ”λ‘  속도 ν–₯상

Tokenizer

  • 2천 5백만 μ΄μƒμ˜ λ¬Έμž₯으둜 ν•™μŠ΅(wiki + news)
  • BPE(Byte Pair Encoding)
  • 50,000 토큰

Data

Data # of Sentences # of Words
Korean Wiki 5M 54M
Korean News 120M 1.6B
Other corpus 9.4M, 18M 88M, 82M
  • μ›μ‹œ λ¬Έμž₯ (Raw text) κΈ°μ€€ μ•½ 20GB의 데이터 μ‚¬μš©

Training

  • 이 μž‘μ—…μ€ μ•„λž˜ μ„Έ νŒ€κ³Όμ˜ ν˜‘μ—…μœΌλ‘œ μ§„ν–‰λ˜μ—ˆμŠ΅λ‹ˆλ‹€.
    • SKT Conv.AIνŒ€ : λŒ€κ·œλͺ¨ μ–Έμ–΄λͺ¨λΈ ν•™μŠ΅ 둜직 κ΅¬ν˜„
    • Amazon Machine Learning Solutions LabνŒ€ : λŒ€κ·œλͺ¨ λΆ„μ‚° ν•™μŠ΅ 인프라 ꡬ성
    • GluonNLPνŒ€ : ν•™μŠ΅ 퍼포먼슀 κ°œμ„ 

How to install

git clone https://github.com/SKT-AI/KoGPT2.git
cd KoGPT2
pip install -r requirements.txt
pip install .
Requirements
  • Python >= 3.6
  • PyTorch == 1.5.0
  • MXNet == 1.6.0
  • onnxruntime == 1.5.2
  • gluonnlp == 0.9.1
  • sentencepiece >= 0.1.85
  • transformers == 2.11.0

How to use

PyTorch

  • 2019λ…„ ν•œν•΄λ₯Ό 보내며,둜 λ¬Έμž₯을 μƒμ„±ν•΄λ‚΄λŠ” κ°„λ‹¨ν•œ 예제
import torch
from kogpt2.pytorch_kogpt2 import get_pytorch_kogpt2_model
from gluonnlp.data import SentencepieceTokenizer
from kogpt2.utils import get_tokenizer

tok_path = get_tokenizer()
model, vocab = get_pytorch_kogpt2_model()
tok = SentencepieceTokenizer(tok_path,  num_best=0, alpha=0)
sent = '2019λ…„ ν•œν•΄λ₯Ό 보내며,'
toked = tok(sent)
while 1:
  input_ids = torch.tensor([vocab[vocab.bos_token],]  + vocab[toked]).unsqueeze(0)
  pred = model(input_ids)[0]
  gen = vocab.to_tokens(torch.argmax(pred, axis=-1).squeeze().tolist())[-1]
  if gen == '</s>':
      break
  sent += gen.replace('▁', ' ')
  toked = tok(sent)
sent
'2019λ…„ ν•œν•΄λ₯Ό 보내며, μƒˆν•΄μ—λŠ” 더 λ§Žμ€ μ‚¬λžŒλ“€μ΄ μƒˆν•΄μ— 이루고자 ν•˜λŠ” μ†Œλ§κ³Ό 희망을 λ˜μƒˆκ²¨λ³΄λŠ” μ‹œκ°„μ΄ λ˜μ—ˆμœΌλ©΄ μ’‹κ² λ‹€.'

model은 λ””ν΄νŠΈλ‘œ eval()λͺ¨λ“œλ‘œ 리턴됨, λ”°λΌμ„œ ν•™μŠ΅ μš©λ„λ‘œ μ‚¬μš©μ‹œ model.train()λͺ…령을 톡해 ν•™μŠ΅ λͺ¨λ“œλ‘œ λ³€κ²½ν•  ν•„μš”κ°€ μžˆλ‹€.

MXNet-Gluon

import mxnet as mx
from kogpt2.mxnet_kogpt2 import get_mxnet_kogpt2_model
from gluonnlp.data import SentencepieceTokenizer
from kogpt2.utils import get_tokenizer

if mx.context.num_gpus() > 0:
  ctx = mx.gpu()
else:
  ctx = mx.cpu()

tok_path = get_tokenizer()
model, vocab = get_mxnet_kogpt2_model(ctx=ctx)
tok = SentencepieceTokenizer(tok_path, num_best=0, alpha=0)
sent = '2019λ…„ ν•œν•΄λ₯Ό 보내며,'
toked = tok(sent)
while 1:
  input_ids = mx.nd.array([vocab[vocab.bos_token]]  + vocab[toked]).expand_dims(axis=0)
  pred = model(input_ids.as_in_context(ctx))[0]
  gen = vocab.to_tokens(mx.nd.argmax(pred, axis=-1).squeeze().astype('int').asnumpy().tolist())[-1]
  if gen == '</s>':
    break
  sent += gen.replace('▁', ' ')
  toked = tok(sent)
sent
'2019λ…„ ν•œν•΄λ₯Ό 보내며, μƒˆν•΄μ—λŠ” 더 λ§Žμ€ μ‚¬λžŒλ“€μ΄ μƒˆν•΄μ— 이루고자 ν•˜λŠ” μ†Œλ§κ³Ό 희망을 λ˜μƒˆκ²¨λ³΄λŠ” μ‹œκ°„μ΄ λ˜μ—ˆμœΌλ©΄ μ’‹κ² λ‹€.'

ONNX

python onnx/export_onnx_kogpt2.py
import torch
import numpy as np
from kogpt2.pytorch_kogpt2 import get_pytorch_kogpt2_model
from gluonnlp.data import SentencepieceTokenizer
from kogpt2.utils import get_tokenizer
import onnxruntime

tok_path = get_tokenizer()
_, vocab = get_pytorch_kogpt2_model()
model = onnxruntime.InferenceSession("./onnx/pytorch_kogpt2_676e9bcfa7.onnx")
tok = SentencepieceTokenizer(tok_path,  num_best=0, alpha=0)
sent = '2019λ…„ ν•œν•΄λ₯Ό 보내며,'
toked = tok(sent)

while 1:
  input_ids = torch.tensor([vocab[vocab.bos_token],]  + vocab[toked]).unsqueeze(0)
  pred = model.run(None, {'input_ids': np.array(input_ids)})[0]
  gen = vocab.to_tokens(torch.argmax(torch.tensor(pred), axis=-1).squeeze().tolist())[-1]
  if gen == '</s>':
      break
  sent += gen.replace('▁', ' ')
  toked = tok(sent)
sent
'2019λ…„ ν•œν•΄λ₯Ό 보내며, μƒˆν•΄μ—λŠ” 더 λ§Žμ€ μ‚¬λžŒλ“€μ΄ μƒˆν•΄μ— 이루고자 ν•˜λŠ” μ†Œλ§κ³Ό 희망을 λ˜μƒˆκ²¨λ³΄λŠ” μ‹œκ°„μ΄ λ˜μ—ˆμœΌλ©΄ μ’‹κ² λ‹€.'

How to deploy the pre-trained KoGPT-2 model to Amazon SageMaker

Pre-trained λ˜λŠ” fine-tuning KoGPT2 λͺ¨λΈμ˜ μΆ”λ‘  APIλ₯Ό λ§Œλ“œλŠ” 방법이 κΆκΈˆν•˜μ‹œλ©΄, AWS ν•œκ΅­ λΈ”λ‘œκ·Έ Amazon SageMaker의 MXNet μΆ”λ‘  μ»¨ν…Œμ΄λ„ˆλ₯Ό ν™œμš©ν•œ KoGPT2 λͺ¨λΈ λ°°ν¬ν•˜κΈ° λ˜λŠ” aws-samples Git repoλ₯Ό μ°Έκ³ ν•΄μ£Όμ„Έμš”.

Demo

KoGPT2-Explorer

링크

Subtask Evaluations

Sentimemt Analysis

NSMC 데이터

Model Test Accuracy
BERT base multilingual cased 0.875
KoBERT 0.901
KoGPT2 0.899

Paraphrase Detection

ν•œκ΅­μ–΄ Paraphrase Detection 데이터

Model Test Accuracy
KoBERT 0.911
KoGPT2 0.943

Examples

Contacts

KoGPT2 κ΄€λ ¨ μ΄μŠˆλŠ” 이곳에 μ˜¬λ €μ£Όμ„Έμš”.

License

KoGPT2λŠ” modified MIT λΌμ΄μ„ μŠ€ ν•˜μ— κ³΅κ°œλ˜μ–΄ μžˆμŠ΅λ‹ˆλ‹€. λͺ¨λΈ 및 μ½”λ“œλ₯Ό μ‚¬μš©ν•  경우 λΌμ΄μ„ μŠ€ λ‚΄μš©μ„ μ€€μˆ˜ν•΄μ£Όμ„Έμš”. λΌμ΄μ„ μŠ€ 전문은 LICENSE νŒŒμΌμ—μ„œ ν™•μΈν•˜μ‹€ 수 μžˆμŠ΅λ‹ˆλ‹€.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].