How does DALL-E 2 work?
The DALL-E 2 generator works on the basis of natural language processing and artificial intelligence to convert the information from a text into a multitude of images. Through deep learning, it is taught which connections it has to make in order to generate the final product. For this learning process, it uses the existing technology of CLIP (Constrastive Language-Image Pre-training). CLIP manages to find suitable text descriptions for an image based on text-image pairs on the Internet. Dalle-E 2 consists of the following two stages:
The first step is to create the AI training process. In this case, CLIP is used to encode text-image pairs and create a so-called latent code.
The text is then converted into a new image. The latent code of the text-image pairs is taken and sent through a so-called prior.
To create variations of the image that match the text, the Generator Decoder is then used. The following steps are used to create a new image variation:
1. First, the text is entered into the text encoder. This is trained by the CLIP model to encrypt the text-image pair.
2. The prior establishes the connection between the CLIP text and the CLIP image, which reflects the information from the text.
3. Finally, the decoder is used to generate new image variations that visually represent the entered text. This allows a variety of different images to be created using different text inputs.