imagine we take an image and add a bit of gaussian noise to it then do this again if we repeat this enough times eventually we'll have an unrecognizable picture of static a sample of Pure Noise now what if we could figure out how to undo this process that is start from noise image gradually remove the noise and end up with a coherent image this is the basic idea behind diffusion models an approach gaining Traction in generative modeling they've had success particularly in the domain of image generation and they are starting to rival and in some
cases surpass other kinds of generative models you may be familiar with on certain tasks for example recent diffusion models have outperformed generative adversarial networks known as Gans in perceptual quality metrics and they've also shown impressive performance in various conditional settings such as converting text descriptions to images in painting and manipulation in this video we'll try to understand the basic mechanism behind diffusion models and how they can be adapted to different generative settings we'll start with a sample from some Target data distribution like an image from a training set let's call this x0 now let's define
a forward diffusion process that gradually adds noise to the image over Big T time steps our model will be tasked with starting at X bigt and undoing this noise through what we'll call the reverse process the forward process which we'll denote with Q takes the form of a Markov chain where the distribution at a particular time step only depends on the sample from the immediately previous step so we can write out the distribution of corrupted samples conditioned on the initial data point x0 as the product of successive single step conditionals in the case of continuous
data each transition is parameterized as a diagonal gaussian beta T here is the variance at a particular time step T typically these variances are treated as hyperparameters and follow a fixed schedule for a particular training run beta generally increases with time and is restricted to be between 0o and one meaning that this coefficient radical 1 minus beta T will likewise be non zero but less than one bringing the mean of each new Gan closer to zero in the limit as T approaches Infinity Q will approach a gaussian centered at zero with identity covariance losing all
information about the original sample in practice the total number of steps Big T is on the order of a thousand using a large albeit finite number of steps allows us to set the individual variances beta T to be very small while still approximately maintaining the same limiting distribution but why do we want to use a small step size what's the benefit well it means that learning to undo the steps of the forward process won't be too difficult let's consider a simple case in one dimension suppose we were given the distribution of a forward process sample
at time T minus one and it resembled a mixture of gaussians with two modes we then observe XT and want to infer the posterior distribution over XT minus one that is we'd like to determine where did the chain likely come from in order to arrive at XT what was the previous step of the chain if the noise step that is Q of XT given XT minus1 is allowed to be large then we will be quite uncertain about the location of XT minus one who knows where we jumped from but if the forward noise step is
restricted to be small there is much less ambiguity about XT minus one we could then be justified in modeling the posterior of the forward step that is Q of XT minus1 given XT with a unimodal gaussian eliminating the contribution from the mode to the right and in fact it can be shown theoretically that in the limit of infinitesimal Step sizes the true reverse process will have the same functional form as the forward process so diffusion models leverage this observation parameterizing each learned reverse step to also be a unimodal diagonal gaussian aside from the sample at
time T the model also takes t as input in order to account for the forward process variant schedule different time steps are associated with different noise levels and the model can learn to undo these individually like the forward process the reverse process is set up as a Markoff chain and we can write out the joint probability of a sequence of samples as a product of conditionals and the marginal probability of X Big T so what is p of X bigt here exactly well it's the same as Q of x bigt the Pure Noise distribution so
at inference time in order to actually generate a sample we start from a gaussian and begin sampling from the Learned individual steps of the reverse process P of XT minus one given XT until producing an x0 okay great so we've defined these forward and reverse diffusion processes the forward process is designed to essentially push a sample off the data manifold turning it into noise and the reverse process is trained to produce a trajectory back to the data manifold resulting in a reasonable sample but what objective will we actually be optimizing is it some Maximum likelihood
objective where we directly maximize the density assigned to x0 by the model well not exactly if we try to calculate P of x0 we see that we have to marginalize over all the possible trajectories all the ways we could have arrived at x0 when starting from a noise sample this unfortunately is contractable but it turns out we can maximize a lower bound to do this let's view X1 through X bigt as latent variables and x0 as an observed variable allowing us to interpret a diffusion model as a kind of latent variable generative model if we
think back to another latent variable model you may be familiar with variational autoencoders commonly known as vaes we might get a hint about our training objective as a quick reminder in a vae we have an encoder that produces a distribution over Laten Z given a data input X and A decoder that reconstructs the data by producing a distribution over data x given a latent input Z so we can think of the forward process in diffusion models as analogous to the encoder producing latence from data and the reverse process as analogous to the decoder producing data
from latens now unlike a vae encoder the forward process here is typically fixed it's the reverse process that we focus solely on learning this means that only a single Network needs to be trained unlike in a vae where two networks are trained jointly so we can now borrow the basic training objective used by vaes and a number of other lat variable models when we have a model with observations X and Laten variable Z we can derive What's called the variational lower bound also known as the evidence lower bound a lower bound on the marginal log
likelihood Ood P Theta of X we won't walk through the full derivation here but the end result is a likelihood term also known as a reconstruction term subtracted by a k Divergence term the likelihood term encourages the model to maximize the expected density assigned to the data while the K Divergence encourages the approximate posterior qz given X to be similar to the prior on the latent variable P of Z as we saw earlier x0 will serve as the observation in the diffusion model framework while X1 through Big T will take the place of the latent
variable Z here let's substitute these in all right now let's simplify a bit we can expand the K Divergence to combine the two terms into a single expectation and finally we can refactor the chain probabilities into their individual steps now there's a nice property of the forward process q that we didn't touch on earlier any arbitrary step of the forward process can be sampled directly in closed form this is just because the sum of independent gaussian steps is still a gaussian so at training time any term of this objective can be obtained without having to
simulate an entire chain likewise we can optimize this objective by randomly sampling pairs of XT minus one and XT and maximizing the conditional density assigned by the reverse step to XT minus one however because different trajectories may visit different samples at time T minus one on the way to hitting XT the setup can have high variance limiting training efficiency to help with this we can rearrange the objective as follows let's examine each component P of X bigt is fixed it's just the start of the reverse process the Pure Noise distribution and as we saw earlier
the whole forward process Q is also treated as fixed so we just have to worry about these two terms to the right here we have a sum of k divergences each between a reverse step and a forward process posterior conditioned on x0 one can prove with base rule that when we treat the original sample x0 as known like it is during training these Q terms are actually just gaussians since the reverse step is already parameterized as a gaussian each K Divergence now is simply comparing two gaussians and can be evaluated in closed form this helps
reduce variance in the training process because instead of aiming to reconstruct Monte Carlos samples the targets for the reverse step become the true posteriors of the forward process given X there are a couple different ways we could imagine implementing the reverse step P Theta in the paper denoising diffusion probabilistic models ddpm for short the authors elect to set the reverse process variances to time specific constants as they found learning them led to unstable training and lower quality samples so the reverse step network is solely tasked with learning the means they then suggest a reparameterization that
aims to have the network predict the noise that was added rather than the gaussian mean first we can rewrite sampling from an arbitrary forward step by using an auxiliary noise variable Epsilon Epsilon here has a constant distribution independent of the forward time step T and the reverse step model can be designed to Simply predict this Epsilon the authors also found that a simpler version of the variational bound that discards the term weights that appear in the original bound led to better sample quality so compared to the original variational lower bound their objective downweight steps that
have very small noise at early time steps of the forward process allowing training to focus on more challenging greater noise steps like other generative Frameworks diffusion models can be made to sample conditionally given some variable of interest like a class label or a sentence description one way to do this is to just feed the conditioning variable y as an additional input during training in theory the model should learn to use y as a helpful hint about what it should be Recon constructing in practice some work has shown that further guiding the diffusion process with a
separate classifier can help in this setup we take a trained classifier and push the reverse diffusion process in the direction of the gradient of the target label probability with respect to the current noise image and we can do this not just with single word labels but also with higher dimensional text descriptions as well of course one drawback of this technique is the Reliance upon a second Network an alternative approach eliminates this Reliance instead using special training of the diffusion model itself to guide the sampling in the paper classifier free diffusion Guidance the conditioning label Y
is set to a null label with some probability during training then at inference time the reconstructed samples are artificially pushed further towards the Y conditional Direction and away from the null label even though no new information is being given to the model they found this to produce higher quality samples under human evaluation compared to classifier guidance in painting is another conditional generation problem where diffusion models have had success the naive way to perform inpainting with diffusion models is to take a model trained in the standard way and an inference time replace known regions of an
image with a sample from the forward process after each reverse step now this works okay but can lead to Edge artifacts the model is not being made aware of the full surrounding context only a hazy version of it instead better results come from fine-tuning a model specifically for this task we can randomly remove sections of training images and have the model attempt to fill them in conditioned on the full clear context we can compare diffusion models to some other prominent deep generative models for sampling tasks diffusion models are somewhat limited by the slow markof chain
this contrasts for example with Gans which can generate images in a single forward pack ongoing work aims to speed up sampling in diffusion models as we saw earlier diffusion models allow us to calculate a variational lower bound on the log likelihood similar to vaes in practice this lower bound can be quite good and even competitive on density estimation benchmarks which have long been dominated by autoaggressive models going Beyond lower bounds a continuous time formulation of diffusion models can give rise to what's called a probability flow OD this enables approximating log likelihood via numerical integration there's
a close connection between denoising diffusion models and what are called score matching models and often these are now grouped together into a single class of models score here refers to the gradient of the log of the target probability density with respect to the data a score network is trained to estimate this value then a Markov chain is set up to actually produce samples from the learn distribution Guided by by this gradient well it turns out the score can actually be shown to be equivalent to the noise that's predicted in the denoising diffusion objective up to
a scaling Factor so we can think of undoing the noise in a diffusion model approximately as trying to follow the gradient of the data log density diffusion models are really gaining momentum and it's been exciting to see their progress check out the links in the description to learn more thanks for watching