MIT 6.S191 (2023): Deep Generative Modeling

311.8k views8073 WordsCopy TextShare

Alexander Amini

MIT Introduction to Deep Learning 6.S191: Lecture 4 Deep Generative Modeling Lecturer: Ava Amini 202...

Video Transcript:

foreign I'm really really excited about this lecture because as Alexander introduced to yesterday right now we're in this tremendous age of generative Ai and today we're going to learn the foundations of deep generative modeling where we're going to talk about Building Systems that can not only look for patterns in data but can actually go A Step Beyond this to generate brand new data instances based on those learned patterns this is an incredibly complex and Powerful idea and as I mentioned it's a particular subset of deep learning that has actually really exploded in the past couple

of years and this year in particular so to start and to demonstrate how powerful these algorithms are let me show you these three different faces I want you to take a minute think think about which face you think is real raise your hand if you think it's face a okay you see a couple of people face B many more people face C about second place well the truth is that all of you are wrong all three of these faces are fake these people do not exist these images were synthesized by Deep generative models trained on

data of human faces and asked to produce new instances now I think that this demonstration kind of demonstrates the power of these ideas and the power of this notion of generative modeling so let's get a little more concrete about how we can formalize this so far in this course we've been looking at what we call problems of supervised learning meaning that we're given data and associated with that data is a set of labels our goal is to learn a function that maps that data to the labels now we're in a course on deep learning so

we've been concerned with functional mappings that are defined by Deep neural networks but really that function could be anything neural networks are powerful but we could use other techniques as well in contrast there's another class of problems in machine learning that we refer to as unsupervised learning where we take data but now we're given only data no labels and our goal is to try to build some method that can understand the hidden underlying structure of that data what this allows us to do is it gives us new insights into the foundational representation of the data

and as we'll see later actually enables us to generate new data instances now this class of problems this definition of unsupervised learning captures the types of models that we're going to talk about today in the focus on generative modeling which is an example of unsupervised learning and is United by this goal of the problem where we're given only samples from a training set and we want to learn a model that represents the distribution of the data that the model is seeing generative modeling takes two general forms first density estimation and second sample generation in density

estimation the task is given some data examples our goal is to train a model that learns a underlying probability distribution that describes the where the data came from with sample generation the idea is similar but the focus is more on actually generating new instances our goal with sample generation is to again learn this model of this underlying probability distribution but then use that model to sample from it and generate new instances that are similar to the data that we've seen approximately falling along ideally that same real data distribution now in both these cases of density

estimation and Sample generation the underlying question is the same our learning task is to try to build a model that learns this probability distribution that is as close as possible to the true data distribution okay so with this definition and this concept of generative modeling what are some ways that we can actually deploy generative modeling forward in the real world for high impact applications well part of the reason that generative models is are so powerful is that they have this ability to uncover the underlying features in a data set and encode it in an efficient

way so for example if we're considering the problem of facial detection and we're given a data set with many many different faces starting out without inspecting this data we may not know what the distribution of Faces in this data set is with respect to Features we may be caring about for example the pose of the head clothing glasses skin tone Hair Etc and it can be the case that our training data may be very very biased towards particular features without us even realizing this using generative models we can actually identify the distributions of these underlying

features in a completely automatic way without any labeling in order to understand what features may be overrepresented in the data what features may be underrepresented in the data and this is the focus of today and tomorrow's software Labs which are going to be part of the software lab competition developing generative models that can do this task and using it to uncover and diagnose biases that can exist within facial detection models another really powerful example is in the case of outlier detection identifying rare events so let's consider the example of self-driving autonomous cars with an autonomous

car let's say it's driving out in the real world we really really want to make sure that that car can be able to handle all the possible scenarios and all the possible cases it may encounter including edge cases like a deer coming in front of the car or some unexpected rare events not just you know the typical straight freeway driving that it may see the majority of the time with generative models we can use this idea of density estimation to be able to identify rare and anomalous events within the training data and as they're occurring

as the model sees them for the first time so hopefully this paints this picture of what generative modeling the underlying concept is and a couple of different ways in which we can actually deploy these ideas for powerful and impactful real world applications yeah in today's lecture we're going to focus on a broad class of generative models that we call latent variable models and specifically distilled down into two subtypes of latent variable models first things first I've introduced this term latent variable but I haven't told you or described to you what that actually is I think

a great example and one of my favorite examples throughout this entire course that gets at this idea of the latent variable is this little story from Plato's Republic which is known as the myth of the cave in this myth there is a group of prisoners and as part of their punishment they're constrained to face a wall now the only things the prisoners can observe are shadows of objects that are passing in front of a fire that's behind them and they're observing the casting of the Shadows on the wall of this cave to the prisoners those

Shadows are the only things they see their observations they can measure them they can give them names because to them that's their reality but they're unable to directly see the underlying objects the true factors themselves that are casting those Shadows those objects here are like latent variables in machine learning they're not directly observable but they're the true underlying features or explanatory factors that create the observed differences and variables that we can see and observe and this gets out the goal of generative modeling which is to find ways that we can actually learn these hidden features

these underlying latent variables even when we're only given observations of The observed data so let's start by discussing a very simple generative model that tries to do this through the idea of encoding the data input the models we're going to talk about are called autoencoders and to take a look at how an auto encoder works we'll go through step by step starting with the first step of taking some raw input data and passing it through a series of neural network layers now the output of this of this first step is what we refer to as

a low dimensional latent space it's an encoded representation of those underlying features and that's our goal in trying to train this model and predict those features the reason a model like this is called an encoder an autoencoder is that it's mapping the data X into this Vector of latent variables Z now let's ask ourselves the question let's pause for a moment why maybe we care about having this latent variable Vector Z be in a low dimensional space anyone have any ideas all right maybe there are some ideas as yes the suggestion was that it's more

efficient yes that's that's gets at it the heart of the of the question the idea of having that low dimensional latent space is that it's a very efficient compact encoding of the rich High dimensional data that we may start with as you pointed out right what this means is that we're able to compress data into this small feature representation a vector that captures this compactness and richness without requiring so much memory or so much storage so how do we actually train the network to learn this latent variable vector since we don't have training data we

can't explicitly observe these latent variables Z we need to do something more clever what the auto encoder does is it builds a way to decode this latent variable Vector back up to the original data space trying to reconstruct their original image from that compressed efficient latent encoding and once again we can use a series of neural network layers such as convolutional layers fully connected layers but now to map back from that lower dimensional space back upwards to the input space this generates a reconstructed output which we can denote as X hat since it's an imperfect

reconstruction of our original input data to train this network all we have to do is compare the outputted Reconstruction and the original input data and say how do we make these as similar as possible we can minimize the distance between that input and our reconstructed output so for example for an image we can compare the pixel wise difference between the input data and the reconstructed output just subtracting the images from one another and squaring that difference to capture the pixel wise Divergence between the input and the reconstruction what I hope you'll notice and appreciate is

in that definition of the loss it doesn't require any labels the only components of that loss are the original input data X and the reconstructed output X hat so I've simplified now this diagram by abstracting away those individual neural network layers in both the encoder and decoder components of this and again this idea of not requiring any labels gets back to the idea of unsupervised learning since what we've done is we've been able to learn a encoded quantity our latent variables that we cannot observe without any explicit labels all we started from was the raw

data itself it turns out that as as the question and answer got at that dimensionality of the latent space has a huge impact on the quality of the generated reconstructions and how compressed that information bottleneck is Auto encoding is a form of compression and so the lower the dimensionality of the latent space the less good our reconstructions are going to be but the higher the dimensionality the more the less efficient that encoding is going to be so to summarize this this first part this idea of an autoencoder is using this bottlenecked compressed hidden latent layer

to try to bring the network down to learn a compact efficient representation of the data we don't require any labels this is completely unsupervised and so in this way we're able to automatically encode information within the data itself to learn this latent space Auto encoding information Auto encoding data now this is of the pretty simple model and it turns out that in practice this idea of self-encoding or Auto encoding has a bit of a Twist on it to allow us to actually generate new examples that are not only reconstructions of the input data itself and

this leads us to the concept of variational autoencoders or vaes with the traditional autoencoder that we just saw if we pay closer attention to the latent layer right which is shown in that orange salmon color that latent layer is just a normal layer in the neural network it's completely deterministic what that means is once we've trained the network once the weights are set anytime we pass a given input in and go back through the latent layer decode back out we're going to get the same exact reconstruction the weights aren't changing it's deterministic in contrast variational

autoencoders vaes introduce a element of Randomness a probabilistic Twist on this idea of Auto encoding what this will allow us to do is to actually generate new images similar to the or new data instances that are similar to the input data but not forced to be strict reconstructions in practice with the variational autoencoder we've replaced that single deterministic layer with a random sampling operation now instead of learning just the latent variables directly themselves for each latent variable we Define a mean and a standard deviation that captures a probability distribution over that latent variable what we've

done is we've gone from a single Vector of latent variable Z to a vector of means mu and a vector of standard deviations Sigma that parametrize the probability distributions around those latent variables what this will allow us to do is Now sample using this element of Randomness this element of probability to then obtain a probabilistic representation of the latent space itself as you hopefully can tell right this is very very very similar to the autoencoder itself but we've just added this probabilistic twist where we can sample in that intermediate space to get these samples of

latent variables foreign now to get a little more into the depth of how this is actually learned how this is actually trained with defining the vae we've eliminated this deterministic nature to now have these encoders and decoders that are probabilistic the encoder is Computing a probability distribution of the latent variable Z given input data X while the decoder is doing the inverse trying to learn a probability distribution back in the input data space given the latent variables Z and we Define separate sets of weights Phi and Theta to define the network weights for the encoder

and decoder components of the vae all right so when we get now to how we actually optimize and learn the network weights in the vae first step is to define a loss function right that's the core element to training a neural network our loss is going to be a function of the data and a function of the neural network weights just like before but we have these two components these two terms that Define our vae loss first we see the Reconstruction loss just like before where the goal is to capture the difference between our input

data and the reconstructed output and now for the vae we've introduced a second term to the loss what we call the regularization term often you'll maybe even see this referred to as a vae loss and we'll go into we'll go into describing what this regular regularization term means and what it's doing to do that and to understand remember and and keep in mind that in all neural network operations our goal is to try to optimize the network weights with respect to the data with respect to minimizing this objective loss and so here we're concerned with

the network weights Phi and Theta that Define the weights of the encoder and the decoder we consider these two terms first the Reconstruction loss again the Reconstruction loss is very very similar same as before you can think of it as the error or the likelihood that effectively captures the difference between your input and your outputs and again we can trade this in an unsupervised way not requiring any labels to force the latent space and the network to learn how to effectively reconstruct the input data the second term the regularization term is now where things get

a bit more interesting so let's go on on into this in a little bit more detail because we have this probability distribution and we're trying to compute this encoding and then decode back up as part of regular regularizing we want to take that inference over the latent distribution and constrain it to behave nicely if you will the way we do that is we place what we call a prior on the latent distribution and what this is is some initial hypothesis or guess about what that latent variable space may look like this helps us and helps

the network to enforce a latent space that roughly tries to follow this prior distribution and this prior is denoted as P of Z right that term d That's effectively the regularization term it's capturing a distance between our encoding of the latent variables and our prior hypothesis about what the structure of that latent space should look like so over the course of training we're trying to enforce that each of those latent variables adapts a problem adopts a probability distribution that's similar to that prior a common Choice when training va's and developing these models is to enforce

the latent variables to be roughly standard normal gaussian distributions meaning that they are centered around mean zero and they have a standard deviation of one what this allows us to do is to encourage the encoder to put the latent variables roughly around a centered space Distributing the encoding smoothly so that we don't get too much Divergence away from that smooth space which can occur if the network tries to cheat and try to Simply memorize the data by placing the gaussian standard normal prior on the latent space we can define a concrete mathematical term that captures

the the distance the Divergence between our encoded latent variables and this prior and this is called the KL Divergence when our prior is a standard normal the KL Divergence takes the form of the equation that I'm showing up on the screen but what I want you to really get away Come Away with is that the concept of trying to smooth things out and to capture this Divergence and this difference between the prior and the latent encoding is all this KL term is trying to capture so it's a bit of math and I I acknowledge that

but what I want to next go into is really what is the intuition behind this regularization operation why do we do this and why does the normal prior in particular work effectively for vaes so let's consider what properties we want our latent space to adopt and for this regularization to achieve the first is this goal of continuity we don't me and what we mean by continuity is that if there are points in the latent space that are close together ideally after decoding we should recover two reconstructions that are similar in content that make sense that

they're close together the second key property is this idea of completeness we don't want there to be gaps in the lane space we want to be able to decode and sample from the latent space in a way that is smooth and a way that is connected to get more concrete what let's ask what could be the consequences of not regularizing our latent space at all well if we don't regularize we can end up with instances where there are points that are close in the latent space but don't end up with similar decodings or similar reconstructions

similarly we could have points that don't lead to meaningful reconstructions at all they're somehow encoded but we can't decode effectively regularization allows us to realize points that end up close in the latent space and also are similarly reconstructed and meaningfully reconstructed okay so continuing with this example the example that I showed there and I didn't get into details was showing these shapes these shapes of different colors and that we're trying to be encoded in some lower dimensional space with regularization we are able to achieve this by Trying to minimize that regularization term it's not sufficient

to just employ the Reconstruction loss alone to achieve this continuity and this completeness because of the fact that without regularization just encoding and reconstructing does not guarantee the properties of continuity and completeness we overcome this these issues of having potentially pointed distributions having discontinuities having disparate means that could end up in the latent space without the effect of regularization we overcome this by now regularizing the mean and the variance of the encoded latent distributions according to that normal prior what this allows is for the Learned distributions of those latent variables to effectively overlap in the

latent space because everything is regularized to have according to this prior of mean zero standard deviation one and that centers the means regularizes the variances for each of those independent latent variable distributions together the effect of this regularization in net is that we can achieve continuity and completeness in the latent space points and distances that are close should correspond to similar reconstructions that we get out so hopefully this this gets at some of the intuition behind the idea of the vae behind the idea of the regularization and trying to enforce the structured normal prior on

the latent space with this in hand with the two components of our loss function reconstructing the inputs regularizing learning to try to achieve continuity and completeness we can now think about how we Define a forward pass through the network going from an input example and being able to decode and sample from the latent variables to look at new examples our last critical step is how the actual back propagation training algorithm is defined and how we achieve this the key as I introduce with vaes is this notion of randomness of sampling that we have introduced by

defining these probability distributions over each of the latent variables the problem this gives us is that we cannot back propagate directly through anything that has an element of sampling anything that has an element of randomness back propagation requires completely deterministic nodes deterministic layers to be able to successfully apply gradient descent and the back propagation algorithm the Breakthrough idea that enabled vaes to be trained completely end to end was this idea of re-parametrization within that sampling layer and I'll give you the key idea about how this operation works it's actually really quite quite clever so as

I said when we have a notion of randomness of probability we can't sample directly through that layer instead with re-parametrization what we do is we redefine how a latent variable Vector is sampled as a sum of a fixed deterministic mean mu a fixed Vector of standard deviation Sigma and now the trick is that we divert all the randomness all the sampling to a random constant Epsilon that's drawn from a normal distribution so mean itself is fixed standard deviation is fixed all the randomness and the sampling occurs according to that Epsilon constant we can then scale

the mean and standard deviation by that random constant to re-achieve the sampling operation within the latent variables themselves what this actually looks like and an illustration that breaks down this concept of re-parametrization and Divergence is as follows so look looking here right what I've shown is these completely deterministic steps in blue and the sampling random steps in Orange originally if our latent variables are what effectively are capturing the randomness the sampling themselves we have this problem in that we can't back propagate we can't train directly through anything that has stochasticity that has randomness what reparametrization

allows us to do is it shifts this diagram where now we've completely diverted that sampling operation off to the side to this constant Epsilon which is drawn from a normal prior and now when we look back at our latent variable it is deterministic with respect to that sampling operation what this means is that we can back propagate to update our Network weights completely end to end without having to worry about direct Randomness direct stochasticity within those latent variables C this trick is really really powerful because it enabled the ability to train these va's completely end

to end in a in through back propagation algorithm all right so at this point we've gone through the core architecture of vais we've introduced these two terms of the loss we've seen how we can train it end to end now let's consider what these latent variables are actually capturing and what they represent when we impose this distributional prior what it allows us to do is to sample effectively from the latent space and actually slowly perturb the value of single latent variables keeping the other ones fixed and what you can observe and what you can see

here is that by doing that perturbation that tuning of the value of the latent variables we can run the decoder of the vae every time reconstruct the output every time we do that tuning and what you'll see hopefully with this example with the face is that an individual latent variable is capturing something semantically informative something meaningful and we see that by this perturbation by this tuning in this example the face as you hopefully can appreciate is Shifting the pose is Shifting and all this is driven by is the perturbation of a single latent variable tuning

the value of that latent variable and seeing how that affects the decoded reconstruction the network is actually able to learn these different encoded features these different latent variables such that by perturbing the values of them individually we can interpret and make sense of what those latent variables mean and what they represent to make this more concrete right we can consider even multiple latent variables simultaneously compare one against the other and ideally we want those latent features to be as independent as possible in order to get at the most compact and richest representation and compact encoding

so here again in this example of faces we're walking along two axes head pose on the x-axis and what appears to be kind of a notion of a smile on the y-axis and you can see that with these reconstructions we can actually perturb these features to be able to perturb the end effect in the reconstructed space and so ultimately with with the vae our goal is to try to enforce as much information to be captured in that encoding as possible we want these latent features to be independent and ideally disentangled it turns out that there

is a very uh clever and simple way to try to encourage this Independence and this disentanglement while this may look a little complicated with with the math and and a bit scary I will break this down with the idea of how a very simple concept enforces this independent latent encoding and this disentanglement all this term is showing is those two components of the loss the Reconstruction term the regularization term that's what I want you to focus on the idea of latent space disentanglement really arose with this concept of beta beta vaes what beta vas do

is they introduce this parameter beta and what it is it's a weighting constant the weighting constant controls how powerful that regularization term is in the overall loss of the vae and it turns out that by increasing the value of beta you can try to encourage greater disentanglement more efficient encoding to enforce these latent variables to be uncorrelated with each other now if you're interested in mathematically why beta vas enforce this disentanglement there are many papers in the literature and proofs and discussions as to why this occurs and we can point you in those directions but

to get a sense of what this actually affects Downstream when we look at face reconstruction as a task of Interest with the standard vae no beta term or rather a beta of one you can hopefully appreciate that the features of the rotation of the head the pose and the the rotation of the head is also actually ends up being correlated with smile and the facial the mouth expression in the mouth position in that as the head pose is changing the apparent smile or the position of the mouth is also changing but with beta vaes empirically

we can observe that with imposing these beta values much much much greater than one we can try to enforce greater disentanglement where now we can consider only a single latent variable head pose and the smile the position of the mouth in these images is more constant compared to the standard vae all right so this is really all the core math the core operations the core architecture of the A's that we're going to cover in today's lecture and in this class in general to close this section and as a final note I want to remind you

back to the motivating example that I introduced at the beginning of this lecture facial detection where now hopefully you've understood this concept of latent variable learning and encoding and how this may be useful for a task like facial detection where we may want to learn those distributions of the underlying features in the data and indeed you're going to get Hands-On practice in the software labs to build variational autoencoders that can automatically uncover features underlying facial detection data sets and use this to actually understand underlying and hidden biases that may exist with those data and with

those models and it doesn't just stop there tomorrow we'll have a very very exciting guest lecture on robust and trustworthy deep learning which will take this concept A step further to realize how we can use this idea of generative models and latent variable learning to not only uncover and diagnose biases but actually solve and mitigate some of those harmful effects of those biases in neural networks for facial detection and other applications all right so to summarize quickly the key points of vais we've gone through how they're able to compress data into this compact encoded representation

from this representation we can generate reconstructions of the input in a completely unsupervised fashion we can train them end to end using the repair maturization trick we can understand the semantic uh interpretation of individual latent variables by perturbing their values and finally we can sample from the latent space to generate new examples by passing back up through the decoder so vaes are looking at this idea of latent variable encoding and density estimation as their core problem what if now we only focus on the quality of the generated samples and that's the task that we care

more about for that we're going to transition to a new type of generative model called a generative adversarial Network or Gam where with cans our goal is really that we care more about how well we generate new instances that are similar to the existing data meaning that we want to try to sample from a potentially very complex distribution that the model is trying to approximate it can be extremely extremely difficult to learn that distribution directly because it's complex it's high dimensional and we want to be able to get around that complexity what Gans do is

they say okay what if we start from something super super simple as simple as it can get completely random noise could we build a neural network architecture that can learn to generate synthetic examples from complete random noise and this is the underlying concept of Gans where the goal is to train this generator Network that learns a transformation from noise to the training data distribution with the goal of making the generated examples as close to the real deal as possible with scans the Breakthrough idea here was to interface these two neural networks together one being a

generator and one being a discriminator and these two components the generator and discriminator are at War at competition with each other specifically the goal of the generator network is to look at random noise and try to produce an imitation of the data that's as close to real as possible the discriminator that then takes the output of the generator as well as some real data examples and tries to learn a classification classification decision distinguishing real from fake and effectively in the Gan these two components are going back and forth competing each other trying to force the

discriminator to better learn this distinction between real and fake while the generator is trying to fool and outperform the ability of the discriminator to make that classification so that's the overlying concept but what I'm really excited about is the next example which is one of my absolute favorite illustrations and walkthroughs in this class and it gets at the intuition behind Gans how they work and the underlying concept okay we're going to look at a 1D example points on a line right that's the data that we're working with and again the generator starts from random noise

produces some fake data they're going to fall somewhere on this one-dimensional line now the next step is the discriminator then sees these points and it also sees some real data the goal of the discriminator is to be trained to Output a probability that a instance it sees is real or fake and initially in the beginning before training it's not trained right so its predictions may not be very good but over the course of training you're going to train it and it hopefully will start increasing the probability for those examples that are real and decreasing the

probability for those examples that are fake overall goal is to predict what is real until eventually the discriminator reaches this point where it has a perfect separation perfect classification of real versus fake so at this point the discriminator thinks okay I've done my job now we go back to the generator and it sees the examples of where the real data lie and it can be forced to start moving its generated fake data closer and closer increasingly closer to the real data we can then go back to the discriminator which receives these newly synthesized examples from

the generator and repeats that same process of estimating the probability that any given point is real and learning to increase the probability of the true real examples decrease the probability of the fake points adjusting adjusting over the course of its training and finally we can go back and repeat to the generator again one last time the generator starts moving those fake points closer closer and closer to the real data such that the fake data is almost following the distribution of the real data at this point it becomes very very hard for the discriminator to distinguish

between what is real and what is fake while the generator will continue to try to create fake data points to fool the discriminator this is really the key concept the underlying intuition behind how the components of the Gan are essentially competing with each other going back and forth between the generator and the discriminator and in fact this is the this intuitive concept is how the Gan is trained in practice where the generator first tries to synthesize new examples synthetic examples to fool the discriminator and the goal of the discriminator is to take both the fake

examples and the real data to try to identify the synthesized instances in training what this means is that the objective the loss for the generator and discriminator have to be at odds with each other they're adversarial and that is what gives rise to the component of adversarial ingenerative adversarial Network these adversarial objectives are then put together to then Define what it means to arrive at a stable Global Optimum where the generator is capable of producing the true data distribution that would completely fool the discriminator concretely this can be defined mathematically in terms of a loss

objective and again though I'm I'm showing math I can we can distill this down and go through what each of these terms reflect in terms of that core intuitive idea and conceptual idea that hopefully that 1D example conveyed so we'll first consider the perspective of the discriminator D its goal is to maximize probability that its decisions uh in its decisions that real data are classified real Faith data classified as fake so here the first term G of Z is the generator's output and D of G of Z is the discriminator's estimate of that generated output

as being fake D of x x is the real data and so D of X is the estimate of the probability that a real instance is fake 1 minus D of X is the estimate that that real instance is real so here in both these cases the discriminator is producing a decision about fake data real data and together it wants to try to maximize the probability that it's getting answers correct right now with the generator we have those same exact terms but keep in mind the generator is never able to affect anything the the discriminator's

decision is actually doing besides generating new data examples so for the generator its objective is simply to minimize the probability that the generated data is identified as fake together we want to then put this together to Define what it means for the generator to synthesize fake images that hopefully fool the discriminator all in all right this term besides the math besides the particularities of this definition what I want you to get away from this from this section on Gans is that we have this dual competing objective where the generator is trying to synthesize these synthetic

examples that ideally fool the best discriminator possible and in doing so the goal is to build up a network via this adversarial training this adversarial competition to use the generator to create new data that best mimics the true data distribution and is completely synthetic new instances foreign what this amounts to in practice is that after the training process you can look exclusively at the generator component and use it to then create new data instances all this is done by starting from random noise and trying to learn a model that goes from random noise to the

real data distribution and effectively what Gans are doing is learning a function that transforms that distribution of random noise to some Target what this mapping does is it allows us to take a particular observation of noise in that noise space and map it to some output a particular output in our Target data space and in turn if we consider some other random sample of noise if we feed it through the generator again it's going to produce a completely new instance falling somewhere else on that true data distribution manifold and indeed what we can actually do

is interpolate and Traverse between trajectories in the noise space that then map to traversals and and interpolations in the Target data space and this is really really cool because now you can think about an initial point and a Target point and all the steps that are going to take you to synthesize and and go between those images in that Target data distribution so hopefully this gets gives a sense of this concept of generative modeling for the purpose of creating new data instances and that notion of interpolation and data transformation leads very nicely to some of

the recent advances and applications of Gans where one particularly commonly employed idea is to try to iteratively grow the Gan to get more and more detailed image Generations progressively adding layers over the course of training to then refine the examples generated by the generator and this is the approach that was used to generate those synthetic those images of those synthetic faces that I showed at the beginning of this lecture this idea of using again that is refined iteratively to produce higher resolution images another way we can extend this concept is to extend the Gan architecture

to consider particular tasks and impose further structure on the networkers itself one particular idea is to say okay what if we have a particular label or some factor that we want to condition the generation on we call this C and it's supplied to both the generator and the discriminator what this will allow us to achieve is paired translation between different types of data so for example we can have images of a street view and we can have images of the segmentation of that street view and we can build a gan that can directly translate between

the street view and the segmentation let's make this more concrete by considering some particular examples so what I just described was going from a segmentation label to a street scene we can also translate between a satellite view aerial satellite image to what is the road map equivalent of that aerial satellite image or a particular annotation or labels of the image of a building to the actual visual realization and visual facade of that building we can translate between different lighting conditions day to night black and white to color outlines to a colored photo all these cases

and I think in particular the the most interesting and impactful to me is this translation between street view and aerial view and this is used to consider for example if you have data from Google Maps how you can go between a street view of the map to the aerial image of that finally again cons extending the same concept of translation bit between one domain to another idea is that of completely unpaired translation and this uses a particular Gan architecture called cyclogam and so in this video that I'm showing here the model takes as input a

bunch of images in one domain and it doesn't necessarily have to have a corresponding image in another Target domain but it is trained to try to generate examples in that Target domain that roughly correspond to the source domain transferring the style of the source onto the Target and vice versa so this example is showing the translation of images in horse domain to zebra domain the concept here is this cyclic dependency right you have two Gans that are connected together via this cyclic loss transforming between one domain and another and really like all the examples that

we've seen so far in this lecture the intuition is this idea of distribution transformation normally with again you're going from noise to some Target with the cycle Gan you're trying to go from some Source distribution some data manifold X to a target distribution another data manifold why and this is really really not only cool but also powerful in thinking about how we can translate across these different distributions flexibly and in fact this is a allows us to do Transformations not only to images but to speech and audio as well so in the case of speech

and audio it turns out that you can take sound waves represent it compactly in a spectrogram image and use a cycle Gan to then translate and transform speech from one person's voice in one domain to another person's voice in another domain right these are two independent data distributions that we Define maybe you're getting a sense of where I'm hinting at maybe not but in fact this was exactly how we developed the model to synthesize the audio behind Obama's voice that we saw in yesterday's introductory lecture what we did was we trained a cycle Gan to

take data in Alexander's voice and transform it into Data in the manifold of Obama's voice so we can visualize how that spectrogram waveform looks like for Alexander's Voice versus Obama's voice that was completely synthesized using this cyclegan approach hi everybody and welcome to my food sickness191 official introductory course here at NYC hi everybody I replayed it okay but basically what we did was Alexander spoke that exact phrase that was played yesterday and we had the Train Cycle Gan model and we can deploy it then on that exact audio to transform it from the domain of

Alexander's voice to Obama's voice generating the synthetic audio that was played for that video clip all right okay before I accidentally uh played again I jump now to the summary slide so today in this lecture we've learned deep generative models specifically talking mostly about latent variable models autoencoders variational Auto encoders where our goal is to learn this low dimensional latent encoding of the data as well as generative adversarial networks where we have these competing generator and discriminator components that are trying to synthesize synthetic examples we've talked about these core foundational generative methods but it turns

out as I alluded to in the beginning of the lecture that in this past year in particular we've seen truly truly tremendous advances in generative modeling many of which have not been from those two methods those two foundational methods that we described but rather a new approach called diffusion modeling diffusion models are driving are the driving tools behind the tremendous advances in generative AI that we've seen in this past year in particular viez Gans they're learning these Transformations these encodings but they're largely restricted to generating examples that fall similar to the data space that they've

seen before diffusion models have this ability to now hallucinate and envision and imagine completely new objects and instances which we as humans may not have seen or even thought about right parts of the design space that are not covered by the training data so an example here is this AI generated art which art if you will right which was created by a diffusion model and I think not only does this get at some of the limits and capabilities of these powerful models but also questions about what does it mean to create new instances what are

the limits and Bounds of these models and how do they how can we think about their advances with respect to human capabilities and human intelligence and so I'm I'm really excited that on Thursday in lecture seven on New Frontiers in deep learning we're going to take a really deep dive into diffusion models talk about their fundamentals talk about not only applications to images but other fields as well in which we're seeing these models really start to make a transformative advances because they are indeed at the very Cutting Edge and very much the New Frontier of

generative AI today all right so with that tease and and and um hopefully set the stage for lecture seven on Thursday and conclude and remind you all that we have now about an hour for open Office hour time for you to work on your software Labs come to us ask any questions you may have as well as the Tas who will be here as well thank you so much [Applause]