MIT 6.S191: Deep Generative Modeling

66.59k views7572 WordsCopy TextShare

Alexander Amini

MIT Introduction to Deep Learning 6.S191: Lecture 4 Deep Generative Modeling Lecturer: Ava Amini *Ne...

Video Transcript:

[Music] so we are now arrived at one of my favorite lectures of the course I think they're all great but this one is particularly um one of my favorites and I think today it's also particularly Salient because we're going to be talking about this notion of generative model or generative Ai and today we're in this moment where this word of generative AI has maybe been propagating a lot and creating a lot of buzz but today we're actually going to dive in to learn the foundations of what this concept of generative modeling even means and what

that core is is that we can build systems that not only look for patterns in data and to make decisions about that data but to actually generate brand new instances of data based on those learned patterns and it's an incredibly powerful idea and I think it gets really a heart at the heart of you know what we mean by intelligence and what we want these types of systems to be capable of doing so let's start with a little uh quiz of sorts I want you to take a look at these three faces and think about

which one of these faces you believe is real is it face a all right maybe 5% of people face B slightly more maybe 7 10% face C also okay I'm I'm measuring these percents relative to the total population not relative to each other none so yes yes now everyone is getting more intelligent and I can't fool people as easily as in the past but the truth is yes they are all fake none of these faces is real none of these people actually exist in fact they were generated by a generative model called a style Gan

and we're going to get to it to this um type of architecture and how it works later in the question in in the lecture sorry um but generative modeling to start from a basic like machine learning point of view when we talk about types of machine learning models or deep learning models that we can build we can make some classifications about what these models are actually doing in terms of the task the objective that they're trying to learn and very very broadly we can think of two classes of models the first class of models that

we've been talking about so far are what we call supervised models and they perform supervised learning meaning that we have data that is of the form of um instances like images or sequences so on and each of those instances is associated with a label something that we're trying to learn a mapping between that data and that label and the goal with deep learning supervised deep learning models is to learn a neural network that maps that function this can be used for classification regression segmentation and so on now there's another whole class of problems out there

in machine learning called unsupervised learning problems and that's going to be our focus in this lecture today and the concept is that with unsupervised learning our goal is not necessarily to learn a mapping from input X to Output y but to just look at X on its own and try to learn patterns and features that Define that data distribution learning the underlying and hidden structure to the data and this is not necessarily something that's exclusive to deep learning and neural networks there are other techniques in stats and machine learning that can perform this type of

unsupervised analysis and the idea here is that with this concept of learning the underlying structure to data what we're ultimately trying to do is to build a neural network that captures something about the distribution of that data and learning a model that represents that distribution what that means and how that is realized is in two principal ways the first is we can try to build models that perform what we call density estimation meaning that these models are going to see a bunch of different samples of data and they're going to learn to try to fit

a function that models the probability distribution associated with how those data fall in some space right the other thing that we can do is we can take this learned distribution this learned uh probability density and then try to now say okay what if because we have this underlying probability distribution I draw new samples from it can I use this sort of sampling operation to actually generate new data instances and this is this notion of sample generation and the common point to both these use cases of generative modeling is this question of how can we learn

a very good model of the probability distribution that is similar to the true probability distribution of our real data right and so it's a very very powerful idea and a framework that's useful in a lot of different settings Beyond just you know generating images of people's faces in fact we can actually use generative models to do more intelligent learning of the algorithm and the model itself to use this information about what aspects of the probability distribution are more represented relative to others to actually train models that are more fair less biased with respect to how

they perform on particular facets of the underlying data distribution and the concept here is that the model is only going to be as good as the data it sees traditionally but if we can use the information about what aspects of the data are more represented in the overall population we could try to think about how we can create more fair and represent ative data sets based on these learned features and this is something that you're going to get hands-on experience with in lab 2 of of the class another use case for using these probability distributions

and learn generative models is in the context of outlier detection right often times we have settings where we want to really be able to preemptively find failure modes right right so in the self-driving car example that Alexander previously introduced there are going to be some instances you know that could occur on the road that are very very rare but are really critical let's say a person walks in front of the car or an animal crosses the road you want the model to be robust to be able to handle those instances and so by modeling this

probability distribution we can detect automatically is this new observation and outline or is it falling squarely in the middle of the probability distribution and then use this information adaptively to improve the model itself so this is kind of a flavor of what types of applications these generative models can be useful for and in today's lecture we're going to specifically dive deeply into two classes of models they're both what we call latent variable models and we'll dive in look at the architectures give you a sense of how they work and ultimately try to convey again the

fundamentals and some of the math Behind these types of modeling Frameworks so before we get to that I think it's really really useful to have a core understanding of what we mean by a latent variable and the example that I love to use and always used to illustrate this is this story from Plato's work the Republic dating back many years uh back to ancient Greece and this is known as the myth of the cave and in this myth there are a group of prisoners and they're forced to stay in this cave and asked to just

face a particular wall of the cave and the prisoners are subject to being imprisoned and they're forced to only actually observe things that are projected in front of them on the wall of this cave so what they're seeing is Shadows of objects that actually exist behind them and they're only seeing the shadow projection that appears in front of them on the cave wall and so to the prisoners right these Shadows are their reality they're the only things that they're observing but we know that in truth there's actually something behind them that is what is casting

the shadow that the prisoners observe and this is this notion of what is an observed variable the shadows and what is a latent variable something that's hidden something that's underlying and actually driving the structure of the problem and of course this is an analogy but it gets at this intuition that latent variables are these underlying elements of structure to data that are the factors that are driving what it is that we're measuring and what it is that we're observing and so our goal with generative models and latent variable models is to try to learn in

an automated way something about a distribution that hopefully captures these underlying drivers or factors that uh are resulting in The observed data that we see so to get at a sense of how we can do this from a deep learning neural network architecture framework we're going to start by talking about a really simple generative model that we call an autoencoder and the notion here is that we can take data and try to learn a lower dimensional encoding some set of features that hopefully represents that data faithfully and so what I'm showing here is an example

with an image the digit 2 and the auto encoder is taking examples of this data and trying to learn a lower dimensional feature space that is representing that in a completely unlabeled way so maybe let's first ask ourselves a question here if we're trying to learn a encoding what I'm denoting here as the variable Z why would we maybe want to care that this uh set of variables Z has a fairly low dimensionality yes us eliminate redundancies exactly other ideas as well efficiency efficiency exactly so the concept here is that we want to eliminate redundancy

while being efficient to basically compress the data right to a lower dimensional encoding that hopefully captures a rich representation of that data and so now the question is how can we actually train a model to learn that lower dimensional encoding the concept of the auto encoder is well okay if we're going to start from high-dimensional data like an image and move down to low dimensional data maybe we can use reconstruction of the data as effectively a task an unsupervised task to teach the model to learn this encoding so the concept here is that the auto

coder takes let's say an image compresses it down to this lower dimensional encoding space and then performs a decoding back all the way up to the dimensionality of the original data space and so effectively you're learning this lossy reconstruction right between your input data and this predicted reconstruction that we're calling X hat now right because this reconstruction is imperfect relative to the uh original data what we can do is we can train the network by comparing the input data X and the Reconstruction X hat and just asking it okay learn to minimize the difference between

the two that's it and this is done using that same type of loss and same objective that we saw I think back in lecture one something like a mean squared error right which with an image just means comp pixel by pixel in the image what is the difference between the original data X and the Reconstruction X2 right so this is a very straightforward way and importantly notice that in this reconstruction operation our loss is only dependent on the input data X and the Reconstruction X hat there's no sense of why there's no sense of labels

here right so the last step that I'm just showing schematically is you know taking those individual layers like convolutional layers abstracting them away in this diagram and again showing you this concept of autoencoding the input going down to a lower dimensional latent space z and then decoding back out to the original data space via this reconstruction and this is a really really powerful idea that I hope you appreciate for a moment here because what this lower dimensional latent space the set of latent variables Z allows us to do is to learn a feature set that

we have not dis observed directly right it's entirely learned by the network but our hypothesis is that it's an effective enough way to be able to reconstruct the data back out right so when it comes to this earlier question of like what's the dimensionality of this space z it turns out a really good way to think about this is this this notion of orthogonality and also of memory bottleneck and compression right so depending on how big you choose that latent space that encoding to be the quality of your reconstructions and the quality of the generations

you can get out is going to be uh different so you can imagine the smaller and smaller you go you're forcing a larger memory bottleneck which means that you're going to ensure more loss and your reconstructions are going to be poorer quality so there's a trade-off there in terms of how good of samples you can get up and how small you make that latent space so the core concept with this idea of autoencoding right is we're effectively forcing the network to learn this compressed latent representation and we're doing this via this reconstruction loss capturing as

much information about the data as possible by uh having the network learn to decode back out and this is the idea and where the name autoencoding kind of comes from um you can think of it as self-encoding or automatically encoding the data now the same concept right of Auto encoding down to this bottleneck latent layer is the basis by which we can start now introducing a little more power in our ability to actually generate new samples and to do that we're going to explore this concept of variation and variational autoencoders traditionally with what we just

saw right in this takx reconstruct it out this is kind of a very deterministic operation right this latent encoding Z is just like any other neural network layer meaning that once it's trained once the network is the network weights are finalized no matter how many times you put this input two in if those weights hold still you're going to get that same reconstruction out right it's a deterministic operation it's not super useful if now you want to generate new samples right because all the network has learned is to reconstruct in order to get more variability

and more sampling what we need is to introduce some notion of Randomness and some notion of an actual probability distribution and the way that is done is via this idea of introducing stochasticity or sampling to the network itself and this is the difference between a variational autoencoder and a standard Auto encoder and the concept here is now in this bottleneck latent layer instead of learning those latent variables directly let's say we have means and standard deviations for each of the latent variables then now let us Define what a probability distribution for each of those latent

variables what that means is now we've gone from a vector of latent variable Z to a vector of means mu and a vector of standard deviations Sigma and so this is effectively putting a probabilistic Twist and flavor on this idea of autoencoding meaning that we can now learn these latent variables defined by a mean defined by a standard deviation and hopefully use this to Now sample from those probability distributions to generate new data instances question great question so the question is do we assume that the distributions are normal short answer is yes we do and

we're going to now transition exactly into explaining why that is a reasonable assumption to make so now to get a little bit further into this right now instead of a purely deterministic setup we have these two halves to this Auto vae autoencoder framework first we have what we call an encoder what's shown in green and second we have a decoder shown in purple right all the encoder is trying to do is take the data and compute a representation of a probability distribution of these latent variables given the data X the decoder does kind of the

inverse now operating back out given those latent variables can we learn the distribution of the data X and these two networks uh the encoder and decoder modules are defined by their own sets of Weights Fe here represents the weights for the encoder component and Theta represents the weights for the uh decoder component the next step is to Define an actual loss function and objective that will allow us to learn this network uh and optimize the vae and to break that down we're going to look at two components to that there's first this reconstruction loss which

is going to be very very similar if not exactly equal to the autoencoder loss that we introduced previously but now we're going to introduce a second term that gets at the probability aspect a little more so first let's remember that we're trying to optimize our loss with respect to the weights Fe and the weights Theta the first component right is that reconstruction with an image we can look at our input data X the Reconstruction X hat directly compare them pixelwise and compute this reconstruction error now we have to introduce a a another term that gets

a little bit more interesting what we call this regulation regularization term and the reason for that is we need to make some assumption as the question asked about what this probability distribution actually looks like effectively what is done with regular regularization is we have to place a prior a guess a hypothesis on what what we expect this distribution of latent variables Z defined by mu and sigma to be like the regularization term is then saying okay now we're going to take that prior and we're going to compare how good the representation learned by the encoder

is relative to that prior how close does it match our assumption about let's say a normal distribution and the concept here is that it's learning kind of the distance it's capturing the distance between the learn distribution the inferred latent distribution q and some prior we have some guess we have about what those latent variables should look like right now the question comes how do we choose that prior right question yeah how do we how do we choose that PRI so the most common choice and what's done very very very frequently in practice is to assume

and place the prior of a normal distribution what that means is it's normal gausian meaning that we assume each latent variable is centered with mean zero and has a standard deviation and variance of one and the rationale here is that it's going to encourage the model hopefully to kind of distribute the encodings of these latent variables roughly evenly around the centered latent space and we can further try to penalize the network when it tries to cheat and diverge too far from this prior I'm going to go into kind of a more intuitive explanation of why

this normal distribution prior makes sense but first briefly very briefly to to touch on kind of what the form of this regularization term actually looks like mathematically there's this metric that we called the KL Divergence which is effect effectively a distance function that says I have one distribution and I have a second distribution how closely do they match right that's all this K Divergence is trying to compute when we place the prior of a normal gausian the k diverges takes a pretty clean form shown here don't need to concern yourself so much with the details

but should just kind of know conceptually that the concept here is we're trying to look at the distance and help regularize the network to try to minimize the similarity between uh minimize the distance between the prior and the Learned latent distribution so now the question that's been raised a couple times right how do we choose this normal prior why do we choose it and what does it effectively do so let's to break that down let's consider right what properties we want this technique of regularization to actually achieve the first idea is that we want our

latent space to be continuous meaning that points that are close together in this Laten space z should hopefully be similar meaning after you decode back out the related content related images so let's say images multiple different types of Twos for example secondly when we go to use our vae to actually sample we want to have completeness meaning that whenever we draw a sample from the latent space hopefully we get something reasonable back out right we don't want samples to result in something nonsensical and so with these two uh criteria in mind the concept is that

regularization without regularization we can sacrifice both these prop we end up sacrificing both these properties things that are close in the latent space may not be similarly decoded meaning there's lack of continuity so for example if we have you know these this Green Point here this Orange Point in the latent space here one may end up decoding back a square with one point and a triangle with another point but they're yet similar and close in the latent space so objects that are actually dissimilar are close in the latent space and we don't want that rather

we want things that are similar to be close in the latent space if we impose some form of regularization what this affects and amounts to is that now points that are ending up close in the latent space are similar when they're decoded and they're decoded meaningfully we're going to get both roughly triangle shapes out with these two points that are located very close to each other in the latent space that gives us a sense of why regularization can be useful why does the normal prior help us is because of the fact that without this regularization

imposing the no normal prior there is not so much of a a structure in terms of what these distributions of the latent variables actually look like meaning that you could get very very small dist very very small variances that end up to pointed distributions in the latent space meaning you're going to get different means when you try to sample you're going to get discontinuities and furthermore on the inverse if you have very very large variances then this kind of destroys any sense of difference right when you have things that are just covering the latent space

very broadly so again illustrating now visually if we get very pointed distributions from not regularizing having the small variance and different means this could lead to discontinuities that we don't want the high level intuition with the normal prior is that it imposes this regularization to be roughly centered mean zero standard deviation of one to try to encourage the network to satisfy these desired properties of continuity and completeness okay any questions on that I know that was it's kind of a hard intuition to to wrap your minds around yes is there a fundal relationship between weights

and so on in the decoding that's learned from the encoding yes so the question is is there a fundamental relationship weights in the decoder and weights in the encoder the answer is yes and that is because the network is learned completely end to endend meaning we don't train the encoder separately and the decoder separately we learn them all at once so they're in they're fundamentally linked and that linkage that pattern of linkage is what is actually learned by the the model in training like completely out track is it anything physical or do we even like

look at the hidden variables just absolutely so the question is how can we interpret the latent variables right the answer is we can interpret them and they're not necessarily always abstract in fact that's kind of the whole point of doing this encoding and this representation learning type of approach is you want to try to extract some notion of structure in those latent variables but do it in a datadriven way right and so I'll show a technique of how we can do that one way you can start to interpret the latent VAR Ables is hold one

fixed or hold all but one fixed right and then effectively perturb change the value of that latent variable holding all the other latent variables constant decode reconstruct the input based on that perturbation and then examine what are the instances you get out how are they changing and use that to now kind of back interpret what that latent variable was capturing and show example of this and it's also a Hands-On example in the software lab I'm going to I know there are a couple questions but I'm going to keep keep progressing to the first question our

our next point and section of this lecture gets out this notion of how we actually train this model end to endend to learn these sets of weights right we have this loss function that comprised of the reconstru C term and the regularization term and our goal is to learn these sets of Weights completely end to end using this loss function there's a problem though and a little bit of a Nuance here in that by introducing this notion of sampling and the uh mean and standard deviation terms here it's not immediately clear we don't have a

direct solution of how we can back propagate GR when we want to capture something probabilistic and so with Vees they employ this really clever trick that effectively diverts the sampling operation to reparameterize the latent variable equation slightly so that we can train the model end to end so rather than saying that our latent variable Z is directly the sample of a normal distribution defined by mu Sigma squ we say let's offset all the randomness all that sampling to this other variable Epsilon and say our latent variables are now the sum of a fixed Vector of

means mu and a fixed Vector of standard deviation Sigma but that's now scaled by some random constant which is going to be drawn from a prior normal distribution and is effectively going to capture the stochasticity that we want out of our sampling operation right fixed mean fixed uh standard deviation vector and as a result this effectively now diverts the sampling operation away from the latent variable Z where we cannot prior back pop through but rather we can rather we can uh learn a fixed variable set of means fixed set of standard deviations and then divert

the stochasticity to this variable Epsilon this allows us to now back propagate gradients through and directly learn the latent variables via this mean and standard deviation Vector so this is called the repar parameterization trick of a v AE and it's ultimately the solution to allow the network to be trained end to end now when it comes to having a trained vae a trained model and trying to understand this question of what do the latent variables actually represent what we can do is keep all variables except for one fixed and slowly perform a perturbation changing the

the value of that single latent variable incrementally and using the decoder component to now decode back out to our original data space and so in this example with the face you're seeing that come to life right where the face is being reconstructed following a perturbation of a single latent variable that hopefully you can see is capturing the head pose the tilt of the face in the image and so what this kind of ETA is now the different dimensions the different latent variables are trying to encode these different features that are hopefully interpretable ideally right with

this idea of orthogonality and feature learning you want those latent variables to be as orthogonal to each other as uncorrelated to each other as possible because then you're maximizing the amount of information that your model is learning across this few set of latent variables and this is this notion of disentanglement and it's a very common question in training these VA type models of how you can disentangle distinct latent variables something like head pose versus smile in this in this example shown here turns out in practice right there's a pretty straightforward solution to actually encourage disentanglement

when you learn a vae and the solution is to basically relatively wait the Reconstruction term relative to the regularization term and tune the relative strength of these two components of the loss to encourage disentanglement and there's this slight new slight variant of a vae called a beta VA vae that just says okay if we put a a weight on the regularization term that's greater than one it can uh constrain the latent bottleneck to basically be encouraged to disentangle these distinct variables such that now you can get um more interpretable and better dis disentangled features with

greater weights on the regularization term and so there's more details in this uh paper that go into the math of why this works but the concept is by imposing stricter regularization you can now try to disentangle these variables more strongly okay so to summarize this part of the talk and to wrap up this concept of lat and variable models you're going to get some hands-on experience in the software lab playing with these types of models in the context of computer vision and facial detection systems and see how we can interpret these types of features and

use those features to actually create better models that are going to be um more fair and more unbiased finally to close right the summary of the the key points and Concepts behind vaes is that fundamentally this notion of latent variable models is effectively trying to compress data into a low dimensional encoding that we can use to both generate new samples and also understand the features underlying that data this framework allows us for completely unsupervised reconstruction we don't need labels associated with the data we can use this notion of reparameterization to train the model end to

end we can interpret the latent variables via this perturbation analysis and we can now use the Learned latent variables to sample from that space and generate new examples so with VES the core is that this latent variable space is effectively um estimate a representation of that probability distribution that I mentioned at the beginning right sometimes though we don't really we want to prioritize the ability to generate very faithful samples while sacrificing our ability to uh learn those features in a more interpretable or probabilistic way and so there's another complementary kind of class of models that

we call generative adversarial networks or Gans that are more focused on this question of how can we just generate samples that are going to be very good sacrificing the the ability to decode and interpret a set of latent variables and the idea behind Gans is rather than trying to explicitly model these probability densities let's just learn a transformation from something very very very simple say completely random noise and now learn a model that can transform that distribution of noise to the real data distribution such that we can then use that learn model to generate samples

that hopefully fall somewhere in that real data distribution and Gans initially introduced a lot of excitement about this idea of generation from completely random noise which seems kind of wild when you think about it in that way right and the intuition behind ganss is pretty clever the concept here is that we're going to put two components of the network what we call a generator and what we call a discriminator in competition with each other they're going to act as adversaries the concept here is that we're going to first have a generator component that looks at

completely random noise and attempts to sample from this noise and let's say up upscale it to the real data distribution and at first its Generations are going to be pretty pretty bad right they're not going to be very good then we're going to have another Model A discriminator come in and look at those generated instances compare them to real data and say is this real or is this fake and by linking these two together and training them jointly the concept is can we induce the generator to create better and better examples that will soon be

good enough to fool the discriminator into not being able to tell what's fake and what's real and they're effectively Waring at uh competing with each other right in this framework and so to break down how this works one of my absolute favorite illustrations not only this in this class but ever is as follows and I think it gives a hopefully a strong intuition behind how ganss work so we're going to consider now this points along this line right we have our generator here on the right it's going to start from completely random noise and try

to create an imitation of data right let's say now these orange points are those fake data samples drawn from noise now the discriminator is going to look at these points it's also going to see instances of real data and the discriminator we're going to train it to say produce a probability that the data it sees are real or if they are fake and in the beginning right it's not trained its predictions are not going to be so good but then let's say we train it we show it some more examples and it starts increasing the

probabilities of what is real and decreasing the probabilities of what is fake now we continue this right training a normal classifier model a discriminator model such that using these real instances and these fake instances the discriminator is perfect it can perfectly tell us what's fake what's real now the generator comes back and it sees now where those real data instances lie and the objective that that we supply to the generator is move your samples closer and closer to these real instances and so it can start doing that and generating new samples that are hopefully if

our objective is good closer and closer cl to the real data now the discriminator sees these new samples and receives these new points and now it's decision is not going to be as strong it's not going to be as clear but again it's estimating the probability that each of these points is real learning to decrease the probability of the fake points increase the probability of the real points once again we repeat this process One Last Time generator starts moving those fake points closer and closer and closer to the real data such that the fake data

are almost overlapping with the real data and so if the discriminator were now to come back and look at these samples on the right it's not going to be able to tell what is real what is fake and that would indicate that the generator has succeeded in learning an objective to generate samples that mimic the real data distribution so this is the summary and the concept of this framework behind the setup ofan where you have a generator component this network G that is trying to synthesize fake data instances to fool the discriminator while conversely the

discriminator is trying to identify the synthesized fake instances from examples of both real and fake and in training right this the objective the loss function that's actually set up for a gan Network composes these two competing objectives for the generator and the discriminator and if we succeed in training the Gan overall we would have reached an Optimum where the generator is able to perfectly reproduce the true data distribution and the discriminator has no power at all in in classifying these instances it turns out that in practice this is pretty hard to do ganss are notoriously

very very difficult to train and in the years since there have been a lot of works in the field that try to propose better loss functions other tricks to stabilize the training of the Gans but in practice it's quite quite a difficult class of models to train and so there's things that we can do to make Gans better and there are also things that we can do to introduce new types of models that are going to be more robust and easier and more stable to train than this uh framework with Gans so at the end

let's say we've trained our Gan model now what do we do with it well to actually use the Gan to generate new instances all we have to do is go back to the generator component of the model and now start with in points in that noise distribution pass them through the generator and Sample to create new data instances and here sample means just picking different points from the initial distribution of random noise and so the concept here is that the Gan is effectively learning an effective transformation of a distribution moving from gausian noise pure random

noise to this learned Target Target distribution what is really really cool is that by we can think about how to Traverse this space of this learn distribution meaning that different samples from gausian noise are going to end up in different places in the data world right and the Gan is learning and mapping that allows us to do this transformation and it turns out that you can actually Traverse and interpolate in the space in the starting space of gausian noise and generate samples that can then be traversed across this learned across the target data distribution and

so the results of this are pretty cool because you can apply the same type of initial perturbation starting from your noise sample to then produce Progressive uh similar images in this case across the target distribution so ganss are really really a very neat framework to think about and they are indeed the architecture that was used to generate some of those types of face images that I showed on slide one right and there are different advances that have been proposed from a modeling perspective that allow better results to come from Gans one neat thing that has

been done that I'll spend a little bit of time on is thinking about now how we can better control the generative process of a gan let's say we want to condition on some different types of information and generate samples accordingly what we can do is not to supply only the random noise as the starting point but also look at other forms of conditioning factors that can then guide the generations of from again one core idea here is this notion of paired translation meaning that we can consider now pairs of inputs for example a scene and

a corresponding segmentation map and now train the discriminator not to operate over only one but rather Pairs and say okay what's a true pair of an image and a segmentation map versus a fake pair of an image and a segmentation map and in practice what this allows us to do is now do these conversions of pair translation where the generator can take in an input of let's say a segmentation scene and generate the output of the the real world view scene or a street view map and generate uh kind of a grid View and different

instances of how you know day to night uh black and white to color so on and so forth of this paired translation type of generation a very common way that this is employed is in uh looking at maps and in translating between let's say a map view and an aerial view of an image and vice versa a very very related idea is this notion of now what if we don't have just linked pairs but we have two related data distributions and again we want to learn some relationship and transformation beh uh that links those two

domains and so an architecture that was proposed to do this is what is called a cycle Gan which is now saying let's say we don't have explicit pairs but rather we have a bunch of instances in one domain and a bunch of instances in another domain not paired how can we learn a relationship mapping these two distributions together and the concept of the of of the cycle Gan is that we can basically impose a cyclic a cyclic loss linking the discriminator and the generator in one domain to that of another and so in this example

here that I'm showing what ends up being accomplished in practice is you can do these um functions where let's say you have images of horses images of zebras and you the model that's shown here is a cycle again that has learned a mapping from the horse domain to the zebra domain and is now uh generating uh those instances accordingly so notice here right in this example there's not only the transformation of the skin of the horse to Stripes but also the grass is looking different it's less green and so the it's entire transformation of the

image distribution itself right again I think that this concept of now domain translation domain transformation gets at this notion that Gans are very effective ways to basically learn transformations of data distributions we can go from learning a a transformation from gausian noise to our Target data distribution or with cyc Gans from one data space X to a Target data space Y and back right and earlier in the in the prior lecture there were some questions about how we actually generated that operates on images of spectrograms which are the conversion of an audio waveform to a

spectrogram image and effectively learned a cycle we built a cyclean model that can transform speech from one domain to speech of another domain so following this Con conversion of the audio waveforms to these spectrogram images the cyclean was trained to perform this conversion and specifically we what we did did was we took recordings of Alexander's voice Alexander speaking converted them to spectrogram images built our cycle Gan model took the recordings of Obama's voice did the same thing and then uh learned the psychoan model to perform this domain transformation between audio of Alexander's voice and audio

of Obama's voice so today we've we've talked about these two classes of deep generative models specifically focusing on latent variable models that can learn these lower dimensional representations and encodings of the data and then this concept of putting these adversarial networks together to now be able to generate new data instances so vaes and ganss are really kind of the two deep learning Frameworks that brought generative modeling and generative AI kind of to life a few years back and since then right the field has really expanded taken off and there is a particular class of models

that have led to quite quite tremendous advances in generative capabilities not just in images but in many other domains and so this is an this is an instance an image instance from one such model which are a class of models called diffusion models that are very closely related to vaes um in the concept and how how they've how they're built but they have this ability to be able to generate new instances with very high fidelity and also be guided and conditioned on different forms of input like text or other types of modalities to really be

able to impose more control over the generative process itself and so tomorrow we're going to talk about uh how these models work and talk about uses of them not only in but other domains as well so with that I'm going to close out getting some hands-on experience with these latent variable models in the context of computer vision and in the context of facial detection in particular thank you [Applause]