All Projects β†’ mehdidc β†’ feed_forward_vqgan_clip

mehdidc / feed_forward_vqgan_clip

Licence: MIT license
Feed forward VQGAN-CLIP model, where the goal is to eliminate the need for optimizing the latent space of VQGAN for each input prompt

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to feed forward vqgan clip

clip-guided-diffusion
A CLI tool/python module for generating images from text using guided diffusion and CLIP from OpenAI.
Stars: ✭ 260 (+92.59%)
Mutual labels:  text-to-image, openai-clip
KoDALLE
πŸ‡°πŸ‡· Text to Image in Korean
Stars: ✭ 55 (-59.26%)
Mutual labels:  text-to-image, vqgan
CLIP-Guided-Diffusion
Just playing with getting CLIP Guided Diffusion running locally, rather than having to use colab.
Stars: ✭ 328 (+142.96%)
Mutual labels:  text-to-image, openai-clip
VQGAN-CLIP-Docker
Zero-Shot Text-to-Image Generation VQGAN+CLIP Dockerized
Stars: ✭ 58 (-57.04%)
Mutual labels:  text-to-image, vqgan
ru-dalle
Generate images from texts. In Russian
Stars: ✭ 1,606 (+1089.63%)
Mutual labels:  text-to-image
Triple Gan
See Triple-GAN-V2 in PyTorch: https://github.com/taufikxu/Triple-GAN
Stars: ✭ 203 (+50.37%)
Mutual labels:  generative-model
Variational Ladder Autoencoder
Implementation of VLAE
Stars: ✭ 196 (+45.19%)
Mutual labels:  generative-model
Dragan
A stable algorithm for GAN training
Stars: ✭ 189 (+40%)
Mutual labels:  generative-model
AC-VRNN
PyTorch code for CVIU paper "AC-VRNN: Attentive Conditional-VRNN for Multi-Future Trajectory Prediction"
Stars: ✭ 21 (-84.44%)
Mutual labels:  generative-model
trVAE
Conditional out-of-distribution prediction
Stars: ✭ 47 (-65.19%)
Mutual labels:  generative-model
universum-contracts
text-to-image generation gems / libraries incl. moonbirds, cyberpunks, coolcats, shiba inu doge, nouns & more
Stars: ✭ 17 (-87.41%)
Mutual labels:  text-to-image
Tf Vqvae
Tensorflow Implementation of the paper [Neural Discrete Representation Learning](https://arxiv.org/abs/1711.00937) (VQ-VAE).
Stars: ✭ 226 (+67.41%)
Mutual labels:  generative-model
idg
Document image generator
Stars: ✭ 40 (-70.37%)
Mutual labels:  text-to-image
Neuralnetworks.thought Experiments
Observations and notes to understand the workings of neural network models and other thought experiments using Tensorflow
Stars: ✭ 199 (+47.41%)
Mutual labels:  generative-model
caffe-simnets
The SimNets Architecture's Implementation in Caffe
Stars: ✭ 13 (-90.37%)
Mutual labels:  generative-model
Voxel Flow
Video Frame Synthesis using Deep Voxel Flow (ICCV 2017 Oral)
Stars: ✭ 191 (+41.48%)
Mutual labels:  generative-model
VQGAN-CLIP
Just playing with getting VQGAN+CLIP running locally, rather than having to use colab.
Stars: ✭ 2,369 (+1654.81%)
Mutual labels:  text-to-image
InpaintNet
Code accompanying ISMIR'19 paper titled "Learning to Traverse Latent Spaces for Musical Score Inpaintning"
Stars: ✭ 48 (-64.44%)
Mutual labels:  generative-model
glico-learning-small-sample
Generative Latent Implicit Conditional Optimization when Learning from Small Sample ICPR 20'
Stars: ✭ 20 (-85.19%)
Mutual labels:  generative-model
Sgan
Stacked Generative Adversarial Networks
Stars: ✭ 240 (+77.78%)
Mutual labels:  generative-model

Feed forward VQGAN-CLIP model, where the goal is to eliminate the need for optimizing the latent space of VQGAN for each input prompt. This is done by training a model that takes as input a text prompt, and returns as an output the VQGAN latent space, which is then transformed into an RGB image. The model is trained on a dataset of text prompts and can be used on unseen text prompts. The loss function is minimizing the distance between the CLIP generated image features and the CLIP input text features. Additionally, a diversity loss can be used to make increase the diversity of the generated images given the same prompt.

Open In Colab

Run it on Replicate

News

How to install?

Download the 16384 Dimension Imagenet VQGAN (f=16)

Links:

Install dependencies.

conda

conda create -n ff_vqgan_clip_env python=3.8
conda activate ff_vqgan_clip_env
# Install pytorch/torchvision - See https://pytorch.org/get-started/locally/ for more info.
(ff_vqgan_clip_env) conda install pytorch torchvision torchaudio cudatoolkit=11.1 -c pytorch -c nvidia
(ff_vqgan_clip_env) pip install -r requirements.txt

pip/venv

conda deactivate # Make sure to use your global python3
python3 -m pip install venv
python3 -m venv ./ff_vqgan_clip_venv
source ./ff_vqgan_clip_venv/bin/activate
$ (ff_vqgan_clip_venv) python -m pip install -r requirements.txt

Optional requirements

  • If you want to use priors (see 09 July 2022 release), please install Net2Net, e.g. with pip install git+https://github.com/CompVis/net2net

How to use?

(Optional) Pre-tokenize Text

$ (ff_vqgan_clip_venv) python main.py tokenize data/list_of_captions.txt cembeds 128

Train

Modify configs/example.yaml as needed.

$ (ff_vqgan_clip_venv) python main.py train configs/example.yaml

Tensorboard:

Loss will be output for tensorboard.

# in a new terminal/session
(ff_vqgan_clip_venv) pip install tensorboard
(ff_vqgan_clip_venv) tensorboard --logdir results

Generate images

After downloading a model (see Pre-trained models available below) or finishing training your own model, you can test it with new prompts, e.g.:

  • wget https://github.com/mehdidc/feed_forward_vqgan_clip/releases/download/0.2/cc12m_32x1024_vitgan.th
  • python -u main.py test cc12m_32x1024_vitgan.th "Picture of a futuristic snowy city during the night, the tree is lit with a lantern"

You can also use the priors to generate multiple images for the same text prompt, e.g.:

  • wget https://github.com/mehdidc/feed_forward_vqgan_clip/releases/download/0.4/cc12m_32x1024_mlp_mixer_openclip_laion2b_ViTB32_256x256_v0.4.th
  • wget https://github.com/mehdidc/feed_forward_vqgan_clip/releases/download/0.4/prior_cc12m_2x1024_openclip_laion2b_ViTB32_v0.4.th
  • python main.py test cc12m_32x1024_mlp_mixer_openclip_laion2b_ViTB32_256x256_v0.4.th "bedroom from 1700" --prior-path=prior_cc12m_2x1024_openclip_laion2b_ViTB32_v0.4.th --nb-repeats=4 --images-per-row=4

You can also try all the models in the Colab Notebook and in Replicate. Using the notebook, you can generate images from pre-trained models and do interpolations between text prompts to create videos, see for instance video 1 or video 2 or video 3.

Pre-trained models

Version 0.4

Name Type Size Dataset Link Author
cc12m_32x1024_mlp_mixer_clip_ViTB32_pixelrecons_256x256 MLPMixer 1.2GB Conceptual captions 12M Download @mehdidc
cc12m_32x1024_mlp_mixer_openclip_laion2b_ViTB32_256x256 MLPMixer 1.2GB Conceptual captions 12M Download @mehdidc
cc12m_32x1024_mlp_mixer_openclip_laion2b_imgEmb_ViTB32_256x256 MLPMixer 1.2GB Conceptual captions 12M Download @mehdidc
cc12m_1x1024_mlp_mixer_openclip_laion2b_ViTB32_512x512 MLPMixer 580MB Conceptual captions 12M Download @mehdidc
prior_cc12m_2x1024_openclip_laion2b_ViTB32 Net2Net 964MB Conceptual captions 12M Download @mehdidc
prior_cc12m_2x1024_clip_ViTB32 Net2Net 964MB Conceptual captions 12M Download @mehdidc

Version 0.3

Name Type Size Dataset Link Author
cc12m_32x1024_mlp_mixer_clip_ViTB32_256x256 MLPMixer 1.19GB Conceptual captions 12M Download @mehdidc
cc12m_32x1024_mlp_mixer_cloob_rn50_256x256 MLPMixer 1.32GB Conceptual captions 12M Download @mehdidc
cc12m_256x16_xtransformer_clip_ViTB32_512x512 Transformer 571MB Conceptual captions 12M Download @mehdidc

Version 0.2

Name Type Size Dataset Link Author
cc12m_8x128 MLPMixer 12.1MB Conceptual captions 12M Download @mehdidc
cc12m_32x1024 MLPMixer 1.19GB Conceptual captions 12M Download @mehdidc
cc12m_32x1024 VitGAN 1.55GB Conceptual captions 12M Download @mehdidc

Version 0.1

Name Type Size Dataset Link Author
cc12m_8x128 VitGAN 12.1MB Conceptual captions 12M Download @mehdidc
cc12m_16x256 VitGAN 60.1MB Conceptual captions 12M Download @mehdidc
cc12m_32x512 VitGAN 408.4MB Conceptual captions 12M Download @mehdidc
cc12m_32x1024 VitGAN 1.55GB Conceptual captions 12M Download @mehdidc
cc12m_64x1024 VitGAN 3.05GB Conceptual captions 12M Download @mehdidc
bcaptmod_8x128 VitGAN 11.2MB Modified blog captions Download @afiaka87
bcapt_16x128 MLPMixer 168.8MB Blog captions Download @mehdidc

NB: cc12m_AxB means a model trained on conceptual captions 12M, with depth A and hidden state dimension B

Acknowledgements

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].