Web-Hostable DALLE Checkpoints
Implementation / replication of DALL-E (paper), OpenAI's Text to Image Transformer, in Pytorch. It will also contain CLIP for ranking the generations.
Deep Daze or Big Sleep are great alternatives!
For generating video and audio, please see NÜWA
This library could not have been possible without the contributions of janEbert, Clay, robvanvolt, Romaine, and Alexander! 🙏


(3/15/21) afiaka87 has managed one epoch using a reversible DALL-E and the dVaE here
TheodoreGalanos has trained on 150k layouts with the following results


Thanks to the amazing "mega b#6696" you can generate from this checkpoint in colab -
(5/2/21) First 1.3B DALL-E from 🇷🇺 has been trained and released to the public! 🎉
(4/8/22) Moving onwards to DALLE-2!
$ pip install dalle-pytorch
Train VAE
import torch
from dalle_pytorch import DiscreteVAE
vae = DiscreteVAE(
image_size = 256,
num_layers = 3, # number of downsamples - ex. 256 / (2 ** 3) = (32 x 32 feature map)
num_tokens = 8192, # number of visual tokens. in the paper, they used 8192, but could be smaller for downsized projects
codebook_dim = 512, # codebook dimension
hidden_dim = 64, # hidden dimension
num_resnet_blocks = 1, # number of resnet blocks
temperature = 0.9, # gumbel softmax temperature, the lower this is, the harder the discretization
straight_through = False, # straight-through for gumbel softmax. unclear if it is better one way or the other
)
images = torch.randn(4, 3, 256, 256)
loss = vae(images, return_loss = True)
loss.backward()
# train with a lot of data to learn a good codebook
Train DALL-E with pretrained VAE from above
import torch
from dalle_pytorch import DiscreteVAE, DALLE
vae = DiscreteVAE(
image_size = 256,
num_layers = 3,
num_tokens = 8192,
codebook_dim = 1024,
hidden_dim = 64,
num_resnet_blocks = 1,
temperature = 0.9
)
dalle = DALLE(
dim = 1024,
vae = vae, # automatically infer (1) image sequence length and (2) number of image tokens
num_text_tokens = 10000, # vocab size for text
text_seq_len = 256, # text sequence length
depth = 12, # should aim to be 64
heads = 16, # attention heads
dim_head = 64, # attention head dimension
attn_dropout = 0.1, # attention dropout
ff_dropout = 0.1 # feedforward dropout
)
text = torch.randint(0, 10000, (4, 256))
images = torch.randn(4, 3, 256, 256)
loss = dalle(text, images, return_loss = True)
loss.backward()
# do the above for a long time with a lot of data ... then
images = dalle.generate_images(text)
images.shape # (4, 3, 256, 256)
To prime with a starting crop of an image, simply pass two more arguments
img_prime = torch.randn(4, 3, 256, 256)
images = dalle.generate_images(
text,
img = img_prime,
num_init_img_tokens = (14 * 32) # you can set the size of the initial crop, defaults to a little less than ~1/2 of the tokens, as done in the paper
)
images.shape # (4, 3, 256, 256)
You may also want to generate text using DALL-E. For that call this function:
text_tokens, texts = dalle.generate_texts(tokenizer, text)
You can also skip the training of the VAE altogether, using the pretrained model released by OpenAI! The wrapper class should take care of downloading and caching the model for you auto-magically.
import torch
from dalle_pytorch import OpenAIDiscreteVAE, DALLE
vae = OpenAIDiscreteVAE() # loads pretrained OpenAI VAE
dalle = DALLE(
dim = 1024,
vae = vae, # automatically infer (1) image sequence length and (2) number of image tokens
num_text_tokens = 10000, # vocab size for text
text_seq_len = 256, # text sequence length
depth = 1, # should aim to be 64
heads = 16, # attention heads
dim_head = 64, # attention head dimension
attn_dropout = 0.1, # attention dropout
ff_dropout = 0.1 # feedforward dropout
)
text = torch.randint(0, 10000, (4, 256))
images = torch.randn(4, 3, 256, 256)
loss = dalle(text, images, return_loss = True)
loss.backward()
You can also use the pretrained VAE offered by the authors of Taming Transformers! Currently only the VAE with a codebook size of 1024 is offered, with the hope that it may train a little faster than OpenAI's, which has a size of 8192.
In contrast to OpenAI's VAE, it also has an extra layer of downsampling, so the image sequence length is 256 instead of 1024 (this will lead to a 16 reduction in training costs, when you do the math). Whether it will generalize as well as the original DALL-E is up to the citizen scientists out there to discover.
Update - it works!
from dalle_pytorch import VQGanVAE
vae = VQGanVAE()
# the rest is the same as the above example
The default VQGan is the codebook size 1024 one trained on imagenet. If you wish to use a different one, you can use the vqgan_model_path and vqgan_config_path to pass the .ckpt file and the .yaml file. These options can be used both in train-dalle script or as argument of VQGanVAE class. Other pretrained VQGAN can be found in taming transformers readme. If you want to train a custom one you can follow this guide
Recently there has surfaced a new technique for guiding diffusion models without a classifier. The gist of the technique involves randomly dropping out the text condition during training, and at inference time, deriving the rough direction from unconditional to conditional distributions.
Katherine Crowson outlined in a tweet how this could work for autoregressive attention models. I have decided to include her idea in this repository for further exploration. One only has to account for two extra keyword arguments on training (null_cond_prob) and generation (cond_scale).
import torch
from dalle_pytorch import DiscreteVAE, DALLE
vae = DiscreteVAE(
image_size = 256,
num_layers = 3,
num_tokens = 8192,
codebook_dim = 1024,
hidden_dim = 64,
num_resnet_blocks = 1,
temperature = 0.9
)
dalle = DALLE(
dim = 1024,
vae = vae,
num_text_tokens = 10000,
text_seq_len = 256,
depth = 12,
heads = 16,
dim_head = 64,
attn_dropout = 0.1,
ff_dropout = 0.1
)
text = torch.randint(0, 10000, (4, 256))
images = torch.randn(4, 3, 256, 256)
loss = dalle(
text,
images,
return_loss = True,
null_cond_prob = 0.2 # firstly, set this to the probability of dropping out the condition, 20% is recommended as a default
)
loss.backward()
# do the above for a long time with a lot of data ... then
images = dalle.generate_images(
text,
cond_scale = 3. # secondly, set this to a value greater than 1 to increase the conditioning beyond average
)
images.shape # (4, 3, 256, 256)
That's it!
Train CLIP
import torch
from dalle_pytorch import CLIP
clip = CLIP(
dim_text = 512,
dim_image = 512,
dim_latent = 512,
num_text_tokens = 10000,
text_enc_depth = 6,
text_seq_len = 256,
text_heads = 8,
num_visual_tokens = 512,
visual_enc_depth = 6,
visual_image_size = 256,
visual_patch_size = 32,
visual_heads = 8
)
text = torch.randint(0, 10000, (4, 256))
images = torch.randn(4, 3, 256, 256)
mask = torch.ones_like(text).bool()
loss = clip(text, images, text_mask = mask, return_loss = True)
loss.backward()
To get the similarity scores from your trained Clipper, just do
images, scores = dalle.generate_images(text, mask = mask, clip = clip)
scores.shape # (2,)
images.shape # (2, 3, 256, 256)
# do your topk here, in paper they sampled 512 and chose top 32
Or you can just use the official CLIP model to rank the images from DALL-E
In the blog post, they used 64 layers to achieve their results. I added reversible networks, from the Reformer paper, in order for users to attempt to scale depth at the cost of compute. Reversible networks allow you to scale to any depth at no memory cost, but a little over 2x compute cost (each layer is rerun on the backward pass).
Simply set the reversible keyword to True for the DALLE class
dalle = DALLE(
dim = 1024,
vae = vae,
num_text_tokens = 10000,
text_seq_len = 256,
depth = 64,
heads = 16,
reversible = True # <-- reversible networks https://arxiv.org/abs/2001.04451
)
The blogpost alluded to a mixture of different types of sparse attention, used mainly on the image (while the text presumably had full causal attention). I have done my best to replicate these types of sparse attention, on the scant details released. Primarily, it seems as though they are doing causal axial row / column attention, combined with a causal convolution-like attention.
By default DALLE will use full attention for all layers, but you can specify the attention type per layer as follows.
full full attention
axial_row axial attention, along the rows of the image feature map
axial_col axial attention, along the columns of the image feature map
conv_like convolution-like attention, for the image feature map
The sparse attention only applies to the image. Text will always receive full attention, as said in the blogpost.
dalle = DALLE(
dim = 1024,
vae = vae,
num_text_tokens = 10000,
text_seq_len = 256,
depth = 64,
heads = 16,
reversible = True,
attn_types = ('full', 'axial_row', 'axial_col', 'conv_like') # cycles between these four types of attention
)
You can also train with Microsoft Deepspeed's <a href="https://www.deepspe
$ claude mcp add DALLE-pytorch \
-- python -m otcore.mcp_server <graph>