These days, when people talk about AI generated art they mean one of two distinct things: Neural Style Transfer or Generative Adversarial Networks. Neural Style Transfer is a name for a family of algorithms that are used to apply the style of one or more existing images to an input image.
Intelligent algorithms are used to create paintings, write poems, and compose music. … According to Christie’s auction advertisement, the portrait was created by artificial intelligence (AI). The media often described this as the first work of art not created by a human but rather autonomously by a machine.
is a 12-billion parameter version of GPT-3 trained to generate images from text descriptions, using a dataset of text–image pairs. We’ve found that it has a diverse set of capabilities, including creating anthropomorphized versions of animals and objects, combining unrelated concepts in plausible ways, rendering text, and applying transformations to existing images.
DALL·E is a simple decoder-only transformer that receives both the text and the image as a single stream of 1280 tokens—256 for the text and 1024 for the image—and models all of them autoregressively. The attention mask at each of its 64 self-attention layers allows each image token to attend to all text tokens. DALL·E uses the standard causal mask for the text tokens, and sparse attention for the image tokens with either a row, column, or convolutional attention pattern, depending on the layer. We plan to provide more details about the architecture and training procedure in an upcoming paper.
Text-to-image synthesis has been an active area of research since the pioneering work of Reed et. al,1 whose approach uses a GAN conditioned on text embeddings. The embeddings are produced by an encoder pretrained using a contrastive loss, not unlike CLIP. StackGAN3 and StackGAN++4 use multi-scale GANs to scale up the image resolution and improve visual fidelity. AttnGAN5 incorporates attention between the text and image features, and proposes a contrastive text-image feature matching loss as an auxiliary objective. This is interesting to compare to our reranking with CLIP, which is done offline. Other work267 incorporates additional sources of supervision during training to improve image quality. Finally, work by Nguyen et. al8 and Cho et. al9 explores sampling-based strategies for image generation that leverage pretrained multimodal discriminative models.
Similar to the rejection sampling used in VQVAE-2, we use CLIP to rerank the top 32 of 512 samples for each caption in all of the interactive visuals. This procedure can also be seen as a kind of language-guided search16, and can have a dramatic impact on sample quality.