All Projects β†’ KR-HappyFace β†’ KoDALLE

KR-HappyFace / KoDALLE

Licence: MIT license
πŸ‡°πŸ‡· Text to Image in Korean

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to KoDALLE

VQGAN-CLIP-Docker
Zero-Shot Text-to-Image Generation VQGAN+CLIP Dockerized
Stars: ✭ 58 (+5.45%)
Mutual labels:  text-to-image, vqgan
ru-dalle
Generate images from texts. In Russian
Stars: ✭ 1,606 (+2820%)
Mutual labels:  text-to-image, dalle
feed forward vqgan clip
Feed forward VQGAN-CLIP model, where the goal is to eliminate the need for optimizing the latent space of VQGAN for each input prompt
Stars: ✭ 135 (+145.45%)
Mutual labels:  text-to-image, vqgan
korean-dev-books
πŸ“š ν•œκ΅­μ–΄ 개발/CS μ„œμ  νλ ˆμ΄μ…˜ 리슀트
Stars: ✭ 51 (-7.27%)
Mutual labels:  korean
NavilIME
Windows Hangul (Korean) Input Method Editor based on TSF
Stars: ✭ 79 (+43.64%)
Mutual labels:  korean
hangul-online
Hangul fonts storage and viewer
Stars: ✭ 16 (-70.91%)
Mutual labels:  korean
CppKoreaSeminar6th
2019λ…„ 9μ›” 29일에 μ§„ν–‰λ˜μ—ˆλ˜ C++ Korea 6회 μ„Έλ―Έλ‚˜ λ°œν‘œ 자료 및 예제 μ½”λ“œ
Stars: ✭ 43 (-21.82%)
Mutual labels:  korean
National-Petition
μ²­μ™€λŒ€ ꡭ민청원 λΆ„μ„μœΌλ‘œ ꡭ민의 생각 μ•Œμ•„λ³΄κΈ° πŸ“ˆπŸ”¬
Stars: ✭ 45 (-18.18%)
Mutual labels:  korean
korean-romanizer
ν•œκ΅­μ–΄λ₯Ό μž…λ ₯ν•˜λ©΄ 둜마자둜 λ³€ν™˜ν•΄μ£ΌλŠ” Java 라이브러리
Stars: ✭ 38 (-30.91%)
Mutual labels:  korean
BERT-embedding
A simple wrapper class for extracting features(embedding) and comparing them using BERT in TensorFlow
Stars: ✭ 24 (-56.36%)
Mutual labels:  korean
tutorials-kr
πŸ‡°πŸ‡·νŒŒμ΄ν† μΉ˜μ—μ„œ μ œκ³΅ν•˜λŠ” νŠœν† λ¦¬μ–Όμ˜ ν•œκ΅­μ–΄ λ²ˆμ—­μ„ μœ„ν•œ μ €μž₯μ†Œμž…λ‹ˆλ‹€. (Translate PyTorch tutorials in KoreanπŸ‡°πŸ‡·)
Stars: ✭ 271 (+392.73%)
Mutual labels:  korean
keras-text-to-image
Translate text to image in Keras using GAN and Word2Vec as well as recurrent neural networks
Stars: ✭ 60 (+9.09%)
Mutual labels:  text-to-image
kss
Kss: A Toolkit for Korean sentence segmentation
Stars: ✭ 198 (+260%)
Mutual labels:  korean
text-to-image
Text to Image Synthesis using Generative Adversarial Networks
Stars: ✭ 72 (+30.91%)
Mutual labels:  text-to-image
redux-saga-in-korean
redux-saga κ³΅μ‹λ¬Έμ„œμ˜ ν•œκ΅­μ–΄ λ²ˆμ—­ ν”„λ‘œμ νŠΈ.
Stars: ✭ 73 (+32.73%)
Mutual labels:  korean
KoreanTextMatcher
ν•œκΈ€ 음절 근사 맀칭/μ΄ˆμ„± 검색 라이브러리
Stars: ✭ 39 (-29.09%)
Mutual labels:  korean
im2txt2im
I2T2I: Text-to-Image Synthesis with textual data augmentation
Stars: ✭ 29 (-47.27%)
Mutual labels:  text-to-image
iOS-Programming-Documents
iOS Programming Documents in Korean
Stars: ✭ 64 (+16.36%)
Mutual labels:  korean
autocorr kr
λ¦¬λΈŒλ ˆμ˜€ν”ΌμŠ€(LibreOffice) μžλ™ ꡐ정(Autocorrect)κΈ°λŠ₯에 λŒ€ν•œ 말λͺ¨μ΄ μ €μž₯μ†Œ
Stars: ✭ 15 (-72.73%)
Mutual labels:  korean
hama-py
πŸ¦› 파이썬 ν•œκΈ€ 처리 라이브러리. Python Korean Morphological Analyzer
Stars: ✭ 16 (-70.91%)
Mutual labels:  korean

KoDALLE

Generic badge Wandb Log

image-20211227151557604

Training DALLE from scratch, utilizing target language's PLMs' token embedding layer and position embedding layer as text encoder.

Background

πŸ“‚ For the project details, please refer to README.pdf

  • Training DALLE model from scratch demands large size paired dataset of images and captions. For example, OpenAI DALLE is trained with more than 250 million text-image pairs for the training.
  • If the dataset isn’t large enough or is limited to specific domains, number of vocabularies in the trained DALLE model are insufficient. For instance, 1 million text captions of K-Fashion dataset only consists of more or less than 300 tokens.
  • Therefore, inferencing from such DALLE models could be problematic if the given sentence query is unconnected to the originally trained captions’ text dataset.

KoDALLE's Result on Small Size Fashion Dataset

OpenAI’s DALLE KoDALLE of HappyFace
Train Dataset Size 250 Million Pairs 0.8 Million Pairs
#Params 12 Billion 428 Million
#Layers 64 Layers 16 Layers
Computing Resource 1024 x V100 16GB 1 x V100 32GB
Text Encoder 16384 Vocab x 512 Dim BPE 32000 Vocab x 1024 Dim klue/roberta-large
Image Encoder VQVAE VQGAN
Optimizer AdamW AdamW
Learning Rate 4.5e-5 3.0e-5
Weight Decay 4.5e-3 3.0e-3
LR Scheduler ReduceLROnPlateau -

The team constructed Text to Fashion Design DALLE model in Korean language with less than 100k text-image sampled pairs.

Caption ν•˜μ˜μ—μ„œ 색상은 μŠ€μΉ΄μ΄λΈ”λ£¨μ΄λ‹€. μƒμ˜μ—μ„œ κΈ°μž₯은 둱이닀. 색상은 ν™”μ΄νŠΈμ΄λ‹€. μΉ΄ν…Œκ³ λ¦¬λŠ” λΈ”λΌμš°μŠ€μ΄λ‹€. λ””ν…ŒμΌμ—λŠ” 셔링이닀. μ†Œλ§€κΈ°μž₯은 λ°˜νŒ”μ΄λ‹€. μ†Œμž¬μ—λŠ” 싀크이닀. ν”„λ¦°νŠΈμ—λŠ” 무지이닀. λ„₯라인은 브이λ„₯이닀. 핏은 λ…Έλ©€
Generated Image image
Caption μ•„μš°ν„°λŠ” 색상이 μΉ΄ν‚€ μ†Œμž¬κ°€ 우븐 핏이 루즈인 μ½”νŠΈμ΄λ‹€. ν•˜μ˜λŠ” 색상이 넀이비 μ†Œμž¬κ°€ λ°λ‹˜ 핏이 μŠ€ν‚€λ‹ˆμΈ 청바지이닀.
Generated Image image
Caption ν•˜μ˜μ—μ„œ κΈ°μž₯은 발λͺ©μ΄λ‹€. 색상은 블루이닀. μΉ΄ν…Œκ³ λ¦¬λŠ” μŠ€μ»€νŠΈμ΄λ‹€. μ†Œμž¬μ—λŠ” λ°λ‹˜μ΄λ‹€. 핏은 μ™€μ΄λ“œμ΄λ‹€. μƒμ˜μ—μ„œ 색상은 ν™”μ΄νŠΈμ΄λ‹€. μΉ΄ν…Œκ³ λ¦¬λŠ” λΈ”λΌμš°μŠ€μ΄λ‹€. λ””ν…ŒμΌμ—λŠ” 셔링이닀. μ†Œλ§€κΈ°μž₯은 λ°˜νŒ”μ΄λ‹€. μ†Œμž¬μ—λŠ” μš°λΈμ΄λ‹€.
Generated Image image
Caption μƒμ˜μ—μ„œ κΈ°μž₯은 노멀이닀. μƒμ˜μ—μ„œ 색상은 ν™”μ΄νŠΈμ΄λ‹€. μƒμ˜μ—μ„œ μ„œλΈŒμƒ‰μƒμ€ λΈ”λž™μ΄λ‹€. μƒμ˜μ—μ„œ μΉ΄ν…Œκ³ λ¦¬λŠ” 티셔츠이닀. μƒμ˜μ—μ„œ μ†Œλ§€κΈ°μž₯은 λ°˜νŒ”μ΄λ‹€. μƒμ˜μ—μ„œ μ†Œμž¬μ—λŠ” 저지이닀. μƒμ˜μ—μ„œ ν”„λ¦°νŠΈμ—λŠ” λ ˆν„°λ§μ΄λ‹€. μƒμ˜μ—μ„œ λ„₯라인은 λΌμš΄λ“œλ„₯이닀. μƒμ˜μ—μ„œ 핏은 λ£¨μ¦ˆμ΄λ‹€.
Generated Image image

Methodology

Experimentations were conducted with the following Korean Transformers Models’ embedding layers. The team selected klue/roberta-large as baseline in the repository considering the size of the model.

KoDALLE with klue/roberta-large's wpe and wte were trained on 32GB V100 GPU environment. Hyperparams related to the DALLE's model size are following.

'BATCH_SIZE': 40
'DEPTH': 16
'TEXT_SEQ_LEN': 128
'VOCAB_SIZE': 32000
'MODEL_DIM': 1024
'ATTN_TYPES': 'full'
'DIM_HEAD': 64
'HEADS': 8

Significance

  • Offers promising result for training from scratch on specific domains with small size dataset.
  • Introduces solution for domain specific DALLE & CLIP models to be robust on input sentence.
  • Recommends adequate text-to-image model size for given computation resource.
  • Suggests effortless method of creating DALLE & CLIP model for own languages if pretrained language model is available.

WIP

  • Add image-caption reranker(EfficientNet + Klue/roberta-large)
  • Model trained with 500k text-image pairs.
  • Modulize in python code.
  • Update Inference code.
  • Update FID and IS metrics on test and validation dataset.

Citations

@misc{ramesh2021zeroshot,
    title   = {Zero-Shot Text-to-Image Generation},
    author  = {Aditya Ramesh and Mikhail Pavlov and Gabriel Goh and Scott Gray and Chelsea Voss and Alec Radford and Mark Chen and Ilya Sutskever},
    year    = {2021},
    eprint  = {2102.12092},
    archivePrefix = {arXiv},
    primaryClass = {cs.CV}
}
@misc{esser2021taming,
    title   = {Taming Transformers for High-Resolution Image Synthesis},
    author  = {Patrick Esser and Robin Rombach and BjΓΆrn Ommer},
    year    = {2021},
    eprint  = {2012.09841},
    archivePrefix = {arXiv},
    primaryClass = {cs.CV}
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].