# <Exploring Diffusion Models: How AI Creates Images from Text>

Written on

In recent discussions, I have frequently mentioned my enthusiasm for new AI tools that create images from text prompts, a hallmark of the current AI renaissance (I subscribe to Midjourney monthly).

My impression of AI research has often been that it's a competitive scramble to optimize complex models to achieve benchmarks, sometimes without a thorough understanding of the underlying mechanics. This has led to a slight aversion on my part to delve deeply into these models, particularly given the rapid pace of innovation. There’s always a concern: what if I invest significant effort into understanding a cutting-edge model that becomes outdated tomorrow?

However, the recent breakthroughs in image generation models—where users can input a description and receive a high-quality image in return (with Midjourney, Dall-E, and open-source Stable Diffusion among the key players)—have compelled me to engage more seriously with this field. And all it took was a bit of attention.

As I explored further, I was intrigued to discover that the theoretical framework behind these diffusion models is quite profound, drawing on concepts from statistical thermodynamics, and intriguingly, it involves elements of time travel. That said, the "artistic hacking" I previously mentioned still exists.

In this article, I will summarize my findings from an intense few weeks of reviewing key papers and will provide a high-level overview of the concepts involved.

All images featured here, unless specified otherwise, are my own.

## I) A Basic Text-to-Image Model

Let’s imagine starting from the ground up. Our goal is to construct a model that can take a text prompt and produce an image. Initially, we'll focus on creating something functional without being overly concerned about quality or performance.

As detailed in section III of the original material, we have a tool called neural networks, which can map vectors from one space to another and learn various mappings if provided with sufficient training data. However, we need to consider that we don't start with vectors; we begin with text prompts and desire images in return.

But think about it—images can be represented as vectors. Each image comprises three color channels: red, green, and blue. Each channel can be visualized as a two-dimensional grid of integers. By flattening these integers into a one-dimensional array, we effectively create vectors. For example, a 256x256 image yields a vector with 256x256x3 = 196,608 elements.

Next, we need to transform our text prompts into vectors. This isn’t as straightforward as converting images. A foundational approach, such as word2vec, can assign a vector to each word in a large text corpus. Subsequently, "sentence2vec" models were developed to extend this concept to sentences.

Now, we can convert both our text prompts (inputs) and images (outputs) into vectors. We can then connect these input vectors to the output vectors through a neural network with several hidden layers in between.

We will utilize training data consisting of numerous pairs of text prompts and corresponding images. By converting these inputs and outputs into vectors as described, we can "train" our neural network. This training involves determining the model's weights so that the output vectors are as close as possible to the actual output vectors for each input.

When a new text prompt is introduced, we convert it into a vector, feed it through our trained neural network, and obtain an output vector. This vector is then transformed back into an image for the user to view.

What might be the limitations of such a model? While I haven't implemented this myself, it's likely that the generated images would mostly resemble white noise or appear unnatural. This relates to what I refer to as the “fluffy cloud problem”.

Furthermore, the model lacks the ability to discern which features of the image are significant to human perception and which are not. Identifying which aspects to focus on and which to ignore are key concerns noted by the authors of the stable diffusion paper, indicating issues that generative AI models must tackle.

## II) The Fluffy Cloud Problem and Generative AI

It's important to recognize that recent advancements in AI, such as chatbots like GPT and image generators, are instances of generative AI, as opposed to discriminative models that perform more specific tasks, like data classification. Generative models learn a probabilistic process to generate the desired outputs, which, in the case of our discussion, are images derived from text.

The initial task for a generative AI model is to learn a probabilistic process capable of producing random instances of the kinds of images we wish to see.

A common trait among AI models is their embedding of complex, interesting objects—like images, text, and audio—into vector spaces, enabling us to leverage various tools effective with vectors (such as linear algebra and neural networks). However, this approach leads to what I call the "fluffy cloud problem." Much of the resulting vector space corresponds to instances that lack meaningfulness to human experience. Only a small, significant portion—the "fluffy cloud"—contains instances (such as images or text) that hold value for humans. If we randomly sample from the entire vector space, we are likely to obtain images that are largely unrecognizable.

Thus, the first step we take is to sample not from the entire vector space but rather from the "fluffy cloud" representing meaningful objects to humans.

Once this is successfully achieved, we can build upon it by conditioning on inputs like text prompts or other images. This formula for generative AI has yielded remarkable results recently, with text-to-image models like Stable Diffusion, Midjourney, Dall-E, and Bing image generator in computer vision, alongside AI chatbots like ChatGPT and Bard in natural language processing.

## III) Diffusion: Statistical Thermodynamics and Markov Chains

The groundbreaking paper that utilized ideas from statistical physics to create a generative model for image generation dates back to 2015, titled “Deep Unsupervised Learning using Nonequilibrium Thermodynamics”. This paper serves as a motivational starting point, illustrating the thermodynamic process of diffusion.

## A) Diffusion

Diffusion, the physical process that involves the mixing of fluids, can be illustrated with a simple example: if you drop ink into a glass of water, the ink initially retains its shape. However, as water molecules collide with the ink molecules, diffusion occurs, and the ink disperses throughout the liquid.

This process can be effectively approximated using a Markov chain, specifically a random walk. Each ink molecule is nudged in a random direction upon colliding with a water molecule, and this nudging is independent of the path it took to arrive at that point—an essential characteristic of Markov chains. This thought process echoes the reasoning Einstein used to postulate the existence of atoms.

Now, let's apply this concept to images. As discussed earlier, images consist of three-dimensional grids of integers, which can be flattened into arrays. While most images from this high-dimensional space may appear as white noise, the images we encounter in reality form a much smaller "cloud" within that space.

Much like ink diffusing in water, imagine this cloud diffusing into the larger vector space. The details in the images from the cloud are lost during this process, resulting in white noise where all pixels are drawn from a multivariate normal distribution.

This represents the "forward diffusion process," where the information in the image is systematically obliterated over several steps, which is straightforward since it only involves the introduction of noise.

However, our goal is to create new images rather than destroy existing ones. Although generating an image that is merely Gaussian noise is simple, we aim to execute the reverse of the forward diffusion process. This entails starting from an image of pure noise and progressively modifying it to yield something that resonates with human perception. This generative process is considerably more complex, as it involves creating order from chaos— a principle governed by the second law of thermodynamics, which asserts that entropy tends to increase.

To achieve this, we require a robust neural network model. This contrasts with the forward process, which solely involves noise injection without the need for a sophisticated neural network.

This serves as a high-level motivation for generative diffusion models. Next, let’s delve into some relevant papers and explore further details.

## IV) The Papers

In this section, we will discuss three pivotal papers that have significantly contributed to the development of contemporary text-to-image AI models. This overview will highlight their key contributions, and interested readers can refer to the original papers for more in-depth information.

## A) Foundational Thermodynamics Paper (2015)

As previously mentioned, the motivation for this paper lies in statistical thermodynamics, focusing on image generation. Titled “Deep Unsupervised Learning using Nonequilibrium Thermodynamics,” it introduced the concept of generating images starting from Gaussian noise. Here, we establish a basic generative model for images. By feeding a set of images into the model, one can derive parameters for a neural network that generates new images resembling the original set. This is why it’s termed "unsupervised learning"—the images are untagged and simply provided as input.

Train the model on numerous cat images, and it will generate entirely new cat images that you have never seen before. The paper also showcases the in-painting of images with damaged areas using Bayesian composition, demonstrating the effectiveness of a generative model without the need for explicitly tagged data.

The ability to condition these generated images based on user input or other images was not yet explored in this paper.

## B) Denoising Diffusion Probabilistic Models (2020)

This paper builds upon the foundational work from 2015 and presents a more refined training objective function for the diffusion process. Subsequent works in this area have widely adopted the framework established in this paper.

**Forward Process**

The core diffusion process, as explained earlier, involves taking a meaningful image and gradually adding noise to it over discrete steps. At the start, we have our "good image" (likely from the training set). We can flatten this image into a vector, denoted as *x_0*. By applying a small amount of noise, we obtain *x_1*. Adding more noise yields *x_2*, and this continues until we reach *x_T*, which is pure white noise. This signifies the forward diffusion process (moving forward in time).

This process transforms a coherent image into a noisy, nonsensical version over several discrete steps (*1, 2, 3,..., T*). The paper details a sequence of scalar parameters for this sequence.

For *t = 1, 2, 3,..., T*, we introduce noise at each step. The parameters typically range between *0* and *1* in an increasing sequence. The paper sets *?1* to 10^-5 and *?T* to 0.04.

Given the original image vector *x_0*, we can characterize *x_1* as a random variable, since the diffusion process alters that vector within its vector space. This can be modeled as a probability distribution function. The distribution of *x_1* given *x_0*, and similarly for *x_2* given *x_1*, follows a Gaussian distribution.

The mean of this distribution is progressively drawn toward *0*, and the covariance matrix approaches *I* (if *?t* starts small and increases toward *1*). The movements in the mean should be proportional to the variance, as the mean represents the first moment and the variance the second moment.

The *(?_t.I)* term represents an identity matrix with *?_t* along its diagonal. It's worth noting that the size of matrix *I* is substantial—if the image is 256x256, the array *x_t* will have 196,608 entries, making *I* an enormous matrix, but we only need to store the diagonals.

In her blog, Wang extends the focus from *q(x_t | x_{t-1})* to *q(x_t | x_0)*. This is valuable since *x_0* represents the distribution of original training images that humans find meaningful. To achieve this, we first need to define:

Through algebraic manipulation, we arrive at an expression for the forward diffusion process spanning back to the original image.

**Reverse Process**

Reversing this process is more complex as it aims to create order from chaos. We begin with *x_T*, the white noise image, and work back to *x_0*, an image that is meaningful to humans. The initial step is to transition from *x_T* (a random Gaussian) to *x_{T-1}*, then from *x_{T-1}* to *x_{T-2}*, and so forth until we achieve *x_0*.

This process unfolds visibly in tools like Midjourney, where the transition from a grey, noisy image to structured, interesting images illustrates the reverse diffusion process.

The earlier equation formulated the distribution for *x_t* given *x_{t-1}*. For the reverse diffusion, we want the distribution of *x_{t-1}* given *x_t*. The original paper asserts that if the steps taken are sufficiently small, the reverse distribution closely resembles an isotropic Gaussian, similar to the forward one. Since we possess the training images, *x_0*, we can condition on that as well.

Using Bayes' rule, we can determine the mean and variance for the reverse process, allowing us to represent these as new variables.

The paper also suggests that we should not attempt to learn the variance term but instead let the *?_t* parameters dictate it.

At this point, the integration of neural networks becomes crucial. We predict the same *?_t* parameter using a neural network, denoted as *?*. Unlike the original *?_t*, this one does not depend on *x_0* since we want the original training data's information to be encoded in the parameters.

To train the neural network, we seek parameters that minimize the squared differences between the true *?_t* and our predicted *?_?*. This process leads to a training algorithm that enables us to sample images from the model effectively.

In the first line, we randomly sample *x_0* from its distribution, selecting an image from our training set. Next, we need the neural network to effectively predict *x_t*, so we draw from all possible values. The steps involved in adjusting the neural network's parameters help establish connections between samples from the random Gaussian and the training image set.

Once the parameters are trained, we can utilize the following algorithm to transition from a Gaussian white noise image to an image sampled from the model using the reverse diffusion process.

The paper opts for a U-Net architecture for estimating the *?_?* parameters.

## C) High-Resolution Image Synthesis with Latent Diffusion Models (2022)

In hindsight, it’s logical that users would want to generate images conditioned on various inputs, such as text or other images. However, it took seven years from the initial paper for this concept to be fully realized in practice. The authors of this paper note that the combination of generative models with diverse conditioning variables remains an under-explored research area.

One straightforward method to achieve conditional generation is to train multiple instances of the model—one for cat images, another for dog images, and so on. However, we aim to create a more flexible approach that allows any user-generated text prompt to serve as a condition. This requires converting the prompt into a vector and integrating it into the training process.

The U-Net architecture previously discussed begins with white noise images and transforms them into generated images. By vectorizing the user's text prompt and incorporating it into the initial white noise, we can generate images based on the user's prompt.

This framework underlies the open-source "stable diffusion" model, which I have previously utilized at no cost (though improved paid variants are now available). The free version produced images that were not quite on par with offerings from Midjourney, Dall-E, or Bing image generator.

## IV) What Does “Back in Time” Mean?

In the article's title, the phrase "back in time" regarding the diffusion process is intriguing. Our physical laws are time-symmetric, functioning equally well forwards and backwards. Yet, the universe unfolds in a singular temporal direction. The second law of thermodynamics, which posits that entropy—or the information content of the universe—must continually increase, is the sole law exhibiting a temporal direction. In the context of this discussion, white noise signifies maximal entropy. Therefore, the forward diffusion process correlates with increasing entropy, or moving forward in time, while the reverse diffusion process effectively retraces this path, moving images backward in time.

## V) Conclusion

This article explored probabilistic diffusion models capable of generating images based on a collection of inputs. We also examined how to condition these models on text and other inputs, culminating in the successful text-to-image models we see today.

## VI) References

[1] Deep unsupervised learning using non-equilibrium thermodynamics (2015) https://arxiv.org/pdf/1503.03585.pdf [2] Denoising Diffusion Probabilistic Models (2020) https://arxiv.org/pdf/2006.11239.pdf [3] High-Resolution Image Synthesis with Latent Diffusion Models (2022) https://arxiv.org/pdf/2112.10752.pdf [4] Diffusion models beat GANs https://arxiv.org/pdf/2105.05233.pdf [5] Lilian Wang's blog on diffusion models: https://lilianweng.github.io/posts/2021-07-11-diffusion-models/ [6] Physics diffusion process https://scholar.harvard.edu/files/schwartz/files/2-diffusion.pdf [7] Original word2vec paper 1301.3781.pdf (arxiv.org) [8] Sentence to vector [2204.00820] Efficient comparison of sentence embeddings (arxiv.org)