WORKING OF DALL-E-2 (PART.2)

Harris Rashid
4 min readSep 7, 2022

--

Part 1: https://medium.com/@harrisrashid839/a-birds-eye-of-the-procedure-behind-text-to-picture-conversion-of-dall-e-2-fea10a1fa015?source=friends_link&sk=09e0cfa94b690e732ece9e87f6cbe9e5

A photorealistic illustration of a cosmonaut mounting a horse
(Image source: https://cdn.openai.com/dall-e-2/demos/text2im/astronaut/horse/photo/9.jpg)

It’s time to delve deeper into the sophisticated and complex mechanism of Dall-E-2 picture production as the first article just covered the fundamentals of how it operates. Let’s begin by examining how DALL-E 2 develops connections between related textual and visual concepts. To make the procedure easier to grasp, it has been divided into phases.

1) Linking Textual and Visual semantics

Another Open AI model called CLIP (Contrastive Language-Image Pre-training) is used to understand the relationship between the textual semantics and their visual representations in DALL-E 2. To determine how much a particular text passage corresponds to a picture, CLIP is trained on hundreds of millions of photos and the captions that go with them.

In other words, CLIP just learns how closely connected every given caption is to a particular image rather than attempting to predict a caption given an image. CLIP can understand the relationship between textual and visual representations of the same abstract item thanks to its contrastive rather than predictive purpose. Let’s examine how CLIP is used to comprehend the inner workings of the DALL-E 2 model as its capacity to acquire semantics from plain language is the foundation of the entire model. However, instead of going into a thorough explanation, we’ll offer you a high-level overview of how things function. The basic tenets of CLIP training are relatively straightforward:

a) All photos and descriptions are first run through the appropriate encoders, which map all objects into an m-dimensional space.

b) After that, each pair of (picture, text) is given a cosine similarity score.

c) The goal of the training is to decrease the cosine similarity between N2 — N incorrectly encoded image/caption pairings while concurrently maximising the cosine similarity between N appropriately encoded image/caption pairs.

In a way that is important for text-conditional image creation, CLIP is vital since it is what ultimately decides how semantically connected a natural language fragment is to a visual notion.

2) Creating Images from Visual semantics

Upon completion of training, the CLIP model is put on hold while DALL-E 2 begins to learn how to reverse the image encoding mapping that CLIP previously discovered. Although our focus is on picture creation, CLIP learns a representation space where it is simple to ascertain the relationship between textual and visual encodings. Therefore, in order to complete this assignment, we must understand how to take use of the representation space. In specifically, Open AI does this picture creation using a modified version of GLIDE, one of its earlier models. To stochastically decode CLIP image embeddings, the GLIDE model learns to flip the picture encoding process. It should be highlighted that the objective is to produce a picture that preserves the key elements of the original image given its embedding, not to develop an autoencoder and precisely reconstruct an image given its embedding. GLIDE employs a Diffusion Model to carry out this picture production. This model learns to create data by reversing a slow noise-generating process. By adding more textual data to the training process and generating text-conditional images as a consequence, GLIDE expands the fundamental idea of Diffusion Models. GLIDE is crucial to DALL-E 2 because by conditioning on picture encodings in the representation space rather than text, it made it simple for the authors to transfer GLIDE’s text-conditional photorealistic image production capabilities to DALL-E 2.

3) Mapping from Textual semantics to Corresponding Visual semantics

Remember that CLIP also trains a text encoder in addition to our picture encoder? To map from the text encodings of picture captions to the image encodings of the related images, DALL-E 2 employs a different model, which the authors refer to as the prior. The authors of DALL-E 2 explore with both autoregressive and diffusion models for the prior but finally discover that they produce performance that is equivalent. The Diffusion Model is used as the prior for DALL-E 2 because it is significantly more computationally effective. In DALL-E 2, the Transformer is a decoder-only Diffusion Prior. It works with a causal attention mask using the following sequentially:

a) The tokenized text/caption.

b) The CLIP text encodings of these tokens.

c) An encoding for the diffusion timestep.

d) The noised image passed through the CLIP image encoder.

e) Final encoding whose output from Transformer is used to predict the unnoised CLIP image encoding.

4) Putting it all together

We already have all of DALL-E 2’s working parts and simply need to connect them in a chain to create text-conditional images:

a) The picture description is first mapped into the representation space by the CLIP text encoder.

b) After that, the diffusion prior translates from a matching CLIP image encoding to a CLIP text encoding.

c) Finally, the modified-GLIDE generation model performs a reverse-Diffusion mapping from the representation space into the image space, producing one potential picture that communicates the semantic information contained in the input caption.

The picture is developed and made accessible to the user.

--

--

Harris Rashid

Skilled Content Writer attempting to inspire the world