[Music] okay hi everyone and welcome back to day three of uh intro to deep learning today is a really exciting day because we're going to learn how we can really try to fuse two really important topics in this field so we're going to look at how we can marry these two topics of rein enforcement learning with deep learning which is the main topic of this class so this field this this marriage between these two topics is really exciting because it moves away from this Paradigm completely that we've seen so far up until this day in
this class uh to date right so instead of deep learning instead of the models that we train in deep learning being shown fixed data sets that don't change over time and super iing them teaching them based on these data sets we're now going to look into problems that involve the Deep learning model exploring and interacting with the data sets in a dynamic way so I can learn how to improve based on these scenarios and these environments that they're being placed in dynamically and evolve over time and the goal of course is to do this entirely
without any human supervision so previously our data sets were largely being created through some supervision of humans but now we're going to look into scenarios where these these models are interacting in an unsupervised manner with their environment and can actually because they're not being constrained by human supervision can actually learn to go to superhuman performances so this has huge obvious implications and impacts in many fields and just to name a few there's robotics self-driving cars but but also on the other side there are strategic scenarios with game playing and also with uh problem solving scenarios
as well so we're going to get a huge flavor of all of these different types of topics and how we can deploy these new types of algorithms and these new settings in these different types of environments so to start I just want to take a step back and see how yes go ahead yes okay they'll be up soon okay okay so I want to start and just uh take a step back first of all and think about how this topic of reinforcement learning is different at a high level from all of the other topics that
we've seen so far in this class right so in day one we focused primarily on this class of learning problems called supervised learning right so in supervised learning which was actually day one and also the first lecture in day two it covers this domain where the input label where the input data X and you have also output labels y now the goal of supervised learning as we saw in those first three lectures is really to try and learn a functional mapping that can go from X to Y right so you showed a bunch of examples
of input data you showed a bunch of labels Y and you want to build some function in the middle that can transform X into a reasonable y so for example this is like if I showed you this picture of an apple on the bottom and you were to classify this picture as this is an apple right so that's supervised learning you showed a bunch of these pictures you train it to say okay these are these are apples these are oranges these are bananas and when you see a new picture of an apple you have to
correctly classify it in the last lecture of day two we discussed a completely new paradigm of problems called unsupervised learning problems and this is where you only have access to the data only the X's you don't have any y's you don't have any labels and you want to try now to learn a model not to transform X's to y's X's to labels but now you're just trying to learn a model that can capture the underlying structure of X of your original data so for example going back to the Apple example on the bottom using unsupervised
learning we can now show our model a lot of data of uh these types of different images and the model should try to learn how to essentially cluster them into different categories right so they should put images that look similar to each other close to each other in some space in some feature space or latent space as we saw yesterday and things that are dissimilar to each other should be very far away right so they the model here doesn't know anything about that this thing is called an apple but it recognizes that this thing shares
similar features to this other thing right and they they should be close to each other so now today's lecture in in reinforcement learning we're going to talk about yet another completely new paradigm of learning problems for deep learning and this is called now where you have data not in the form of inputs and labels but you're going to be shown data in the form of states and actions and these are going to be paired pieces of data now states are the observations of your agent we'll get back to what those terms mean in a second
and the actions are the decisions that that agent takes when it's sees itself in a certain state so the goal of reinforcement learning said very simply is to create an agent to create a model that can learn how to maximize the future rewards that it obtains right over many time steps into the future so this is again a completely new paradigm of learning problems that we've seen before there are no ground truth labels in reinforcement learning right you only have these State action Pairs and the objective the goal is to maximize some reward function so
now in this apple example one more time we might see this picture of an apple and the agent would have to learn you know to eat this thing because it has learned that it gives it some nutrition and it's good for it if it eats this thing again it knows nothing that this thing is an apple it knows nothing about kind of like the structure of the world but it just has learned some actions to do with these objects by interacting with the world in a certain way so in today's lecture that's going to be
the focus of what we talk about and we'll explore this in Greater detail but before we go any further I think it's really important for us to dissect and really understand all of the terminology because because this is a new type of learning problem there's a lot of new uh pieces of terminology that are associated to reinforcement learning that we don't really have in supervised learning and unsupervised learning so I want to start by firstly building up some vocabulary with everyone such that it will be necessary for the rest of this lecture and we'll keep
referring back to all of these different pieces so I'll start with an agent an agent is uh something that can take actions right so for example in uh in let's say autonomous delivery of packages a drone would be an agent okay in Super Mario games Super Mario would be the agent right it's the thing taking the actions uh the agent or the algorithm itself is the agent right it's the thing that takes the action so in life for example we are all agents the next piece of vocabulary is the environment you should think of this
as simply the world in which the agent lives and it can take actions in and the agent can sem send commands to its environment in the form of these quote actions uh and we can also denote just for formality let's call a the set of all possible actions that this agent could possibly take so in a very simplified world we could say that this agent let's say it's in a two-dimensional World it can move forwards it can move backwards it can move left it can move right right so the set of all possible actions are
these four different actions okay now coming back in the opposite direction observations are simply how the environment interacts back to the agent so observations are states essentially that the environment s and shows to the agent and how the agent observes the world now a single state is just a concrete and immediate situation in which the agent finds itself in right that's the immediate observation that the agent is present in and finally here's another new part specifically towards reinforcement learning is the reward so in addition to providing a state from environment to action the environment will
also provide a reward so a reward is simply the feedback which we can measure or which the environment provides to measure the success or the failures or the penalties of the agent in that time step okay so what are some examples so in a video game when Mario touches a gold coin you know the points go up right so he gets a positive reward right uh let's see what are some other examples so if if he were to jump off the cliff and fall to the bottom he gets a very negative reward right penalty and
the game is over okay so rewards can be both immediate in the sense of the gold coin Right Touch the gold coin you get an immediate reward but they can also be delayed and that's a very important concept so you may take some actions that result in a reward much later on but they were critical actions that you took at this time step and the reward was delayed and you don't recognize that reward until much later it's still a reward and that's a very important concept so we can now look at not only the reward
at one time step which is rft but we can also look at the total reward which is simply you can think of the sum of all all rewards up until that time okay so we'll call this capital r of T that's just going to be the sum of all of the rewards up until time T so from Time Zero to time t if we expand it it would look like this right so it's the reward of time t plus the reward of time uh T minus one and so on so often it's useful to not
only consider the reward at time T right or the sum of all rewards up until that time but also to consider the discounted What's called the discounted reward over time okay so what does that mean so the discount Factor which here is denoted as this gamma term okay so it's a gam it's a typically a fixed term so you have one discounting factor in your environment and your environment provides that typically you can think of the discounting factor as a factor that will dampen the effects of a reward over time okay so why would you
want to do this essentially a discounting factor is designed such that it will make immediate or let me say it will make future rewards much less uh worth much less than immediate rewards okay so what's a what's an example of kind of a uh an example where you have this enforcing of a short-term learning on an agent right so for example if I was to offer you a reward of $5 today or a reward of $5 in one year it's still a reward of $5 but you have an implicit discounting factor which allows you to
prioritize that reward of $5 today over the reward of $5 in one year's time right and that's what the discounting factor does the way you apply it in this framework though is that you multiply it by the future Awards as discovered by the agent in order to dampen all of the the future of rewards that the agent sees now again this is just meant to make it such that the future rewards are less worth less than any of the immediate rewards that it's okay now finally there's one more really important concept that's critical to understand
as part of all of this and that's called The Q function now and and yeah I think it's also critical that we think about the Q function in the context of all of the previous terminologies that we've covered so far so let's look at how the Q function is defined in relation to all of those previous uh variables that we had just EXP explored so remember that the total reward it's also called The Return of your agent the return is capital r of T so that's the discounted sum of rewards from that time point on
now remember that the Q function now is a function that can take as input the state the current state that you're in and a possible action that you take from this state and it will try to return the expected total future reward or the the the return of that agent that can be received from that time point up until the Future Okay so let's think about that a little bit more just digested so given a state that you're in and an action that you take from that state the Q function will tell you what is
the the expected amount of reward that you will get from that point on if you take that action right now if you change the action AF of T in the current state your Q function will say that okay this is actually going to return you even more reward or maybe even less reward right than the previous one so this Q function is actually a critical function that allows you to do a lot in re in reinforcement learning so now the question is I guess if we're for now assume that we're given this magical Q function
right I give it to you and you can query it you can call it at your desire how can we as an agent so put yourself in the agent's point of view and I give you the Q function how would you take actions to solve an environment right what actions would you take in order to properly solve this environment so any ideas yes just keep maximizing the Q function right so what actions would you take you would take the ones that would a maximize the Q function right so at every time Point ultimately what you
have to recognize as the agent's point of view right is at every time step you want to create some policy right so what we're thinking of here is now this policy function which is slightly different than the Q function so the policy function will denote it with a pi it only takes us input s and it's going to say what is the optimal action that I should take in this state so we can actually go from or we we can try to evaluate valate the policy function using the Q function right and as as was
stated this can be done just by trying to choose an action which maximizes the future return of that agent right so the Q function the sorry the optimal policy function Pi how to act in a current state s is simply going to be defined by taking the action which gives you the highest Q value right so you can evaluate all possible actions from your Q function and pick the action that gives you the highest return or the highest future reward now in this lecture what we're going to focus on at a high level is really
looking at two different ways that we can think about solving this reinforcement learning problem and thinking about this Q function in specific right so there are two broad categories which we're going to think about in terms of learning this Q function right because so far I've only said if I give you the Q function I haven't told you actually how we create that Q function and basically these two categories can be split up such that on one side we have value learning algorithms and on the second side we have policy learning algorithms so the second
class of algorithms these policy learning algorithms in particular are directly trying to learn not the Q function as we previously said but they're directly trying to learn that final policy function right which if you think about it are is is a much more direct way of accomplishing your goal ultimately what the agent wants to do is see a state and know the action to take in that state right and that's exactly what the policy function takes policy function gives you but first let's focus on the value learning side of things and we'll cover that and
then we'll build up our our foundations to go to policy learning so in order to cover value learning we have to first start by digging a bit deeper into this Q function right this is a really critical function especially when thinking about the value learning side of the problem so first I'll introduce this Atari Breakout game probably many of you have seen it before but in case not I'll just quickly describe it so the the concept of the game is that the agent is this paddle This horizontal line on the bottom it can take two
actions roughly it can move left or it can move right or can can also stay you know not move at all but for Simplicity let's just say two actions left or right and the environment is composed of this you know two-dimensional World it has this ball that is uh being a projectile it's coming towards the paddle and the objective of the paddle is to basically move left and right to reflect and bounce the ball off because you want to hit as many of the blocks as possible on the top every time you hit a block
on the top those colored blocks you will break off a block and you get a point so if the objective of the game is actually to remove all of the colored blocks on the top you can do that by keep moving around and keep hitting the ball and breaking them off one at a time so the Q function here is going to tell us the expected total return of our agent such that we can understand how uh or we can expect you know which given a certain state right and a certain action that this paddle
could take as that state right and which of the two which of the Rewards or which of the actions would be the optimal action to take right which of the actions would be the one that Returns the most in expectation a reward by breaking off as many of those uh blocks on the top as possible right so let me show just a very quick example right so we have these two states and actions okay so for every state State you can also see the desired action by the paddle so here's State a which is where
the the ball is basically coming straight onto the paddle and the paddle chooses the action of not moving at all right just reflecting that ball back up and we have state B which is the ball coming in at an angle almost missing the paddle and the paddle is moving towards the the thing it might miss it it might get it but it's trying to catch up and and grab that padal between these two State action pairs right so they they each have or they can each be fed into the Q function and if we were
to evaluate the Q value of each of these State action pairs a question for all of you is which state action pair do you think will return a higher expected reward for this agent which one is a better State action pair to be in any ideas A or B so let's see between a raise sense okay B okay so someone that said a can can we explain the answer H okay yep seem pretty safe you will not lose your ball just back and obviously there will be some breaks there and do get point exactly so
yeah the answer is it's it's a very safe action it's very conservative you know the um yeah you're you're you're definitely going to get some reward from this state action pair okay now what about B anyone answering for B yep uh if you choose b um there's a possibility that it could Ricochet and uh like you can hit more blocks with a um it just bounces up and down like and you get a minimum amount of points but you can probably like exploit it go around uh if you can get it above yeah exactly so
yeah the answer is basically B is a bit more erratic of an an of a state action pair and there's a possibility that you know some some crazy things could happen uh if if you have this like very extreme fast moving thing that comes off to the side so this is just a great example of you know why uh determining the Q function and learning a q function is not always so obvious right so it's it's a very intuitive answer would be a but in reality we can actually look at how policy behaves in two
of these different cases so first let's look at a policy that was trained to maximize situations like a and we'll we'll play this video forward a little bit right so this is a very conservative policy it's usually just hitting the balls straight back up to the top and learning to you know solve the game it does break off the balls or break off the colored things on the top but it's doing it in a very conservative manner right so it takes some time now let's switch over and let's look at B right so B is
having this more erratic uh behavior and it's actually moving away from the ball just so that it can come back towards it and hit it with that extreme value just so it can you know break off something on the side get the ball stuck on the top and then break off and get a lot of free points basically so it's learned kind of this this hack to the system in some ways right to to uncover this right and it's just a good example of you know the intuition behind Q values for us as humans is
not always aligned with with you know what the AI algorithms can uh can discover during reinforcement learning so they can actually create and find some very interesting Solutions maybe they are not always exactly as we expect so now let's see if if we know that Q function right then we can directly use it to determine the best action that the the agent should take in that scenario right we saw how that was possible before the way that we would do that is we would take you know all of our possible actions we would feed them
through the Q function one at a time evaluate the Q value for each of those actions and then be able to determine which action resulted in the greatest Q value right so now how can we train a network to determine the Q function right so there are two different ways that we could even think about you know structuring and architecting such a network right we could have the state and the action be inputs to our system and then we output a q function Cube value for that state action pair that's going to be basically as
you see on the the left hand side so it's a single number for a state and an action input to the network or you could imagine something on the right hand side which probably is a bit more efficient right because you only feed in a state and then the model has to learn the Q value for each of the possible actions so this would work if you have a fixed and small set of actions right if you're only going left or right your output would just be two numbers Q value for left Q value for
right and often it's much more convenient to do it like this right where you just output the Q value for all of your actions uh just so you can you don't have to run your network multiple times with multiple different action inputs so how can we actually train this what's called Deep Q Network right it's a network that learns the Q value we should think about first the best case scenario of our of our agent right so how would our agent perform in the ideal scenario right what would happen if the agent were to take
all of the the very best actions right at each step it's taking the best actions that it could this would mean that the target return right the this this Q function or this Q value I should say the Q value would be maximized because it's taking all of the optimal actions and this would essentially serve as the ground truth to train our agent right so our agent is going to be trained using a Target Q value function function which is going to be assumed to be the optimal one the one that maximizes all of our
uh agents action or agents Q values now the only thing left now is we have to you know use that Target to train our predicted Q value right so to do this now all we have to do is formulate our expected uhe our expected return our expected Q value where we take all of the best actions right so that's the initial reward that we start with we select an action that maximizes the expected Q value at that point and then we apply that discounting factor we compute the the total sum of returns from that point
on and then we use that as our Target and now all we have to ask is you know what would the network predict in this case so that we can learn to optimize this right and of course for our Network prediction that's a lot easier because now at least in the second example we have formulated the network to direct tell us the predicted Q value for each of these actions right so we have a ground we have a prediction as desired and now all we have to do is formulate you know a mean squared error
subtract these two pieces between the Target and the prediction you know compute a distance from them so maybe we Square them compute the norm right and then that's the piece that we want to minimize right we want to minimize the deviation between the predicted Q value coming from our model and kind of that Target Q value which is obtained on the left hand side so this is known as the Q loss and this is exactly how Q networks deep Q networks are trained so let's take a second now and just summarize that whole process end
to end because that was a lot in there so our deep neural network is now going to see only States as inputs and it's going to Output the Q value for each of the possible actions right so here we're showing three actions going left staying stationary and going right so our network will have three outputs it will be the value of uh the Q value for each of those actions taken with this input state so here the actions again left right or stay stationary now the policy how an agent would act in this state right
if you now see a new state you have to figure out what action do you take you take all of your outputs all of the Q values you pick the maximum one because that's the one that has the highest expected return at least predicted by your network and you pick the action that that corresponds to so for example here we have uh Q value of 20 for moving left Q value of three for moving stationary and Q value of zero for moving right so here we would pick the action of Left Right and finally we
would send this action back to the environment we would actually take that left action the state would update through our game engine right so we will now see a new state and then this process will repeat again right and then the network will have to receive this new state process all of the Q values for that possible action for those possible actions and then pick the best one again yes what about when the action is not one specific Val a ring y we'll get to that in a second y yes so do we compute let's
say Q of s and A1 one step or we have to play the game until it ends great question so you actually have to compute every step right because you have to know you can't play the game until you know what to do on the next step right so you see a state you need to do something right so the thing that you do has to require running this Q Network so that you can compute the [Music] action right so that's why to train the network so in the beginning these networks they don't know anything
right they're going to Output gibberish as their Q values right and that's why you actually have to train it using the Q law so over time those values will improve and over time you're going to predict better and better actions great yeah um how do you know when to update the next state so for example if you go to the left the ball might just Bounce Around really wacky for a while and then at some point you would need to read the state but you might end up reading it prematurely and doing something suboptimal because
the ball is still up in the air or whatever right so the question is about uh you know when to update the state as I understand it right so the state update so this is happening by the environment actually so you don't have control of that right so you send an action the game always updates the screen right so that's your new state right and you have to make a new action you don't have a Cho right like so even doing nothing is an action right being stationaries and that so that that is kind of
not up to the agent right life goes on even if we don't take actions right and the states are constantly changing regardless of what we choose to do okay so with this framework in mind right we can now see a few cool examples of of how this can be applied so a few years ago Deep Mind showed actually the the first really uh scalable example of training these types of deep CU networks and how they could be applied to solve not just one type of game but they actually solve a whole variety of different types
of Atari games by providing the state as an input on the left hand side okay and then on the right hand side it's Computing the output the Q value output for every possible action you can see here it's no longer three possible actions right this is from like a a game controller right so you have a bunch of different actions right still not a huge number of actions we're going to think if like the action Space is really large how can you build these networks to handle that but for now I mean this is already
a really impressive result because what they showed was that if you test this network on many different games in fact all of the Atari games that existed they tested on they showed that for over 50% of the games right these deep Q networks were able to surpass Human Performance right just with like very basic you know algorithm that we talked about today using a basic CNN that takes us input just the pixels on the screen of an Atari game with no human supervision no ground truth labels the agents just learn how to play in these
games they watch they they play basically uh you know a game they they update and they reinforce their learning and they evolve over time and just using that algorithm that we talked about for deep Q learning they were able to surpass 50 50% of the games surpassed Human Performance the other games were more challenging so you can see those on the right hand side they were more challenging but given how Simple and Clean This algorithm is that we've just talked about in this lecture it's it's still remarkable impressive to me at least how this thing
even works and and beats humans for even 50% of Atari games okay so now let's talk very briefly about downsides of Q learning so a couple were already mentioned uh just organically by by some of you already but let's talk about it now more formally so number one the complexity right so the model scenarios where Q learning can work in this framework that we've just described really involve scenarios where the action space is very small and also even more important the action space has to be discret right so it has to be a fixed number
of action categories right it's not saying you know so the actions would have to be like left versus right it's not how fast to move to the left or how fast to move to the right right that's a continuous number it's a speed but it's not an it's a not a discrete category left right okay the other one is the flexibility right so Flex in in policies if we in the Q value case sorry the Q values are needed to compute your policy right your policy remember is the function that takes us input a state
and computes an action so that function requires your Q value your Q function and it's done by you know just maximizing that Q function over all of the possible different actions so that means that inherently you cannot learn stochastic policy with this type of framework that we've discussed so far right the the policies have to be deterministic because you're always picking that maximum so to address these kind of challenges let's now move on to the next part of the lecture which is going to be this second class of reinforcement learning algorithms called policy learning or
policy gradient uh algorithms right so again in this class of algorithms now we're not going to try and learn the Q function and use the Q function to comput a policy function we are directly going to try and find the policy function so Pi of s it's a function that takes us input the state and you compute or you sample an action by sampling from that distribution of the policy okay so to to cover this again very briefly reiterate the the Q deep Q function networks so here State comes in all of the Q values
for all of the actions come out and you pick the action which maximizes your uh Q value now instead of outputting your Q values what we're going to do now is directly optimize a policy function okay so our policy function is going to also take as input our state right so the input here the architecture so far is the same that we saw before right and Pi of s which is going to be our policy function that's our that's our output that's going to be the policy distribution right let's now think of it as a
probability distribution which governs the likelihood that we should take any action given this state okay so here the outputs are going to give us a desired action or I should even say the likelihood that this action is the best action given the state that we are in so this is actually really nice because the outputs now are probabilities right which has a few nice probabil properties because those probabilities that an action is the one that we should take tells us very important things about our state so if we were to look and predict these probabilities
for this given State let's imagine we see something like this right so if we were to see 90% the left uh action is the best possible action we could take 10% that doing nothing is the best thing and 0% that do going right is the best thing we can aggregate these into a uh into a prediction function which is essentially now we want to compute our action how would we do that we would not take the maximum anymore now what we're going to do is sample from this probability distribution right so even with this same
probability distribution if you ran at a 100 times on expectation 10% of them should not be the maximum right about 10% of them should have the agent stay still so note that again this is a probability distribution so all of those action or all of those uh outputs have to sum to one and I want to spend a moment here and maybe this is a question for all of you what are some some advantages or especially from the flexibility side right now of formulating the architecture in this kind of way does anyone see any any
concrete advantages yes we could potentially learn some amazing new technique that maybe we couldn't think of ourselves right so uh the idea is it can it can learn some things that we couldn't think of ourselves this is true and the reason is because the sampling of these actions is now going to be stochastic right you're not always picking what the network thinks is the best action you're going to have some more exploration of the environment right because you're going to constantly sample even if your answer is like this 90% 10% 0% you know you'll still
pick the the non- maximum answer 10% of the time right that allows you to explore the environment a lot more yes so for nonzero sum games policy uh gradient you so you could actually use both types of algorithms in non-zero sum games but what you would want to make sure is you know from a flexibility point of view right so the differences that we're seeing so far are much more focused on the actions and the sampling of those actions right than the environment itself at least so far yes there's a fundamental thing here that when
you say Q orig is an expectation of you know s give s today and then how can expectation onless you some distribution or some assume form of the future so I'm missing how you can have a well defined CU level on a p without specifying what do mean by expectation value over the future there so many different features right so yeah this is where the the Q learning and policy learning differ greatly right so in the Q learning side you do have an expectation because you do actually roll out the whole game until the end
of the game right in the policy learning side there is no expectation except from the point of view that you have this now distribution over the different actions that you can take and that gives you an expectation as well but that you know that that's separate right from the learning side okay so getting back to the advantages of policy learning over Q learning right and and digging into this a bit deeper so one advantage here is is just in the context of discret versus continuous which was mentioned earlier in the lecture right so discret action
spaces what does that mean it means that we have a finite set of possible actions that could be taken at any possible time so for example we've been seeing this case of pong right or or breakout right and and the action space is indeed discreet you can go left right or stay stationary but let's assume I reformulate the actions of this game a little bit but and now I make them continuous so now instead of left right or or nothing I'm going to make it a speed at which I should move on that horizontal AIS
so now if we were to plot kind of the probability of any possible speed being the the best speed that will maximize the uh the reward of this agent it might look something like this right it's going to be a now continuous probability distribution as opposed to a discrete categorical distribution that we saw previously so let's let's dig into that key idea right and think about how we could Define these types of continuous policy learning networks as well how can you have your model output a distribution right and we can actually do this uh using
the policy gradient method so instead of predicting a probability of taking an action over every possible state that you could be in let's instead learn a parameter or the parameters of a distribution which Define that probability function instead okay so let's let's step through that a bit more so in yesterday's lecture we learned how that our latent spaces could actually be predicted by a neural network but those are continuous latent spaces over gausian random variables right we're not learning the PDF over every possible gausian uh random variable but instead we're just learning the parameters the
M's and the sigmas that define those gaussians so now let's do something similar instead of outputting the probability for every possible action which in a continuous action space there are an infinite number of possible actions now let's let's learn a mu and a sigma that Define the probability distribution for us so for this image for example that we see on the left hand side we can see that the paddle needs to move to the left so if we plot that distribution predicted by this neural network we can see that the density lies on the left
hand side of the number line right and it's telling us not only that it should move left but it should it also tells us some information about you know how quickly or with what speed or with what urgency it needs to move to the left so we can Now sample not only inspect this distribution but we can Now sample from that distribution to get a concrete action that the paddle should execute in this state right so if we sample from this gusan with mean minus one and standard deviation .5 we might get a value for
example like this which is minus8 okay so it's just a random sample every time you sample from this distribution you'll get something different and again even though this is a continuous extension of the same ideas that we saw previously it still follows that this is still a a uh proper probability distribution so if we were to take the integral over this probability output of the neural network it's integral with sum to one right this is a valid probability distribution okay so let's now turn to look how policy gradients could be applied applied in a concrete
example so let's revisit kind of this RL terminology learning Loop that we had looked at earlier in the lecture and let's think of how we could train now this uh this use case right so let's imagine we want to train a self-driving car uh and we want to use the policy gradient algorithm the agent here is the car it's the vehicle itself right the environment is the the world in which the vehicle drives the states are all of the sensory informations that the the vehicle sees so it's coming from the cameras the Liars The Radars
Etc uh the actions right these are the steering wheel angles we'll think of it simply just in terms of steering wheel angle let's not think about you know speed for now let's just only think about steering wheel angle and the reward let's think of a very simple one which is just drive as far as you can without crashing right um so we're not optimizing for for obvious things like Comfort or safety let's just say let's drive as f as far as you can not crashing implicitly you will care about safety right because you can't drive
far without uh being safe to some degree okay so let's let's start with this example so how would we train a policy gradient model in this context of self-driving cars let's start with initializing the agent on the road okay so we'll put it in the middle of the road and we'll start this system off we'll now run a policy the policy here is an neural network so we'll we'll run forward the policy this is called a roll out so we'll execute a roll out of this agent through its environment and we'll see kind of the
the trajectory of the vehicle over the course of training right so this this policy has not been trained right so it it didn't do very well it kind of veered off and and crashed on side of the road what we're going to do here is now take that policy or take that roll out I should say and record all of the state action pair so at every state we're going to record what that state was and what the action that our policy did at that time step now something very simple the the the optimization process
in policy gradients is just going to assume the following that I'm going to decrease the probability of everything that I did in the second half of my roll out because I came close to that crash and I'm going to increase the probability of doing everything that I did in the first half okay so is this is this a uh is this really a good thing to do it's probably not the optimal thing to do right because there could definitely be some very bad actions that you took in the beginning of your roll out that caused
you to get into a bad State and that caused you to to crash so you shouldn't necessarily increase those beginning actions but in reality remember you don't have any ground truth labels for any of this right so this is a reasonable heuristic that in expectation it actually works out pretty well and repeat this process again so you decrease the probability of all of those things that came towards the crash and you increase the probability of everything that came in the beginning and now the the policy or the agent is able to go a bit farther
and you do this again and you keep repeating this process over and over again and and you'll eventually see that the agent performs better and better actions that allow it to accumulate more and more rewards over time until eventually it starts to follow the lanes without crashing right and that's that's EX exactly the policy gradient algorithm I've left out kind of the exact loss function but we'll talk about that in one second and the remaining question here is you know uh you know how can you actually update the policy based on those rollouts that you're
observing So based on those State action pairs that are being observed by the model as a course of the rollout we're going to now increase the probability of everything that came in the beginning and decrease the probability of everything that came at the end so this is really the the the last critical question of how do you do this in practice right so let's look at the loss function for you know doing that optimization decreasing the probability of everything that came close to the crash increasing the probability of everything that came in the beginning so
the loss consists here of of two terms right so the first is this log likelihood term right this is going to be selecting an action that was chosen for this particular State and then the second is going to be that we multiply it by the total discounted reward R of T okay now let's assume that we get a lot of reward for an action that had a very high log likelihood so it had a very high probability and we got a lot of return a lot of reward so that's great because when you multiply those
two numbers together you're going to try and optimize that action even more to happen on the next time but now let's assume you took an action that resulted in a very negative return that would mean that you're now going to try those will be multiplied against each other and now you'll try to decrease the probability of those actions from happening again in the future so when you plug you know that loss into the gradient descent algorithm right that we've been using as part of all of the trainings for all of the neural networks that we've
seen in this course we're going to see now the policy gradient algorithm or sorry the policy gradient itself right so it's the gradient of that policy term uh highlighted in blue and that's exactly how this this algorithm got got its name so I want to talk a little bit now on how we can extend this approach to perform reinforcement learning in real life right and I want to think specifically in the context of this use case that we had gone through with self-driving cars for a second right so if we wanted to deploy this exact
algorithm as we've described it in the real world which of these steps I guess on the the left hand side here this training algorithm which of these steps do you think are the ones that would kind of break down in real life yep I heard y go ahead recording all states yeah I think the so the recording the states is okay right so why why do you think this um because I feel like when we're looking at the car example yeah um the path is continuing and so like the path bu that we choose to
look at we can constantly decrease decreasing get so like the size of the the magnitude of the state that can keep increasing yes yeah so as you basically the the point is that if the if the car gets really good over time then recording the states will be harder and harder because you have to store them all in memory that's true but it would also be a problem in in simulation as well right you you'd have that same problem so there are tricks to get around that maybe you don't store all of the states you
only optimize a subset of them yes we run a simulation with the original states in SE cont like driving yes yes so reproducibility is a problem if you're in simulation if you're in the real world it's it's even harder right to to reproduce the the roll outs that's true uh exactly so yeah number two is the is the real problem in the real world right um so if you wanted to to do number two it involves crashing the car right by definition so and that's only for one step of the algorithm right so it's it's
a lot of bad things just to you don't want to train your car to learn how to drive just by collecting a lot of data of crashes right so this uh even though in theory all of this sounds really great in practice you know doesn't work so well for for real world let's say autonomy situations uh one really cool result that we had created actually here at MIT was this simulator which is hyper photorealistic right so it looks like this these are simulated scenes of an autonomous vehicle driving through these environments right and these are
basically environments that you can you can train in a way that is safe to to to crash right these are built from real data of the real world so you'll see you may even recognize massav in some of these right so it's uh right here at MIT and and the beautiful thing is that if you can have a simulation environment which is safe to do step number two right to do this crashing Behavior essentially then this type of uh problem is is really well suited the only problem then becomes the photo realism right you need
your simulation to accurately be faithful to reality but in in these types of hyperphoto realistic simulators right like this one that we had created here at MIT the the agent can indeed be placed inside the simulator and it can be trained using the exact same model and the exact same loss function that we have just seen as part of this class and then when you take those policies trained in simulation you can put them on a full-size car in the real world so now this is running in the real world and they can drive through
roads that they've never seen before right so very powerful idea going straight from SIM to real and this was actually the first time that uh an endtoend neural network was trained in simulation and could drive in the real world on a brand new road and that came right here from MIT okay so now we've covered the the fundamentals behind value learning and policy gradient optimization for reinforcement learning approaches what are some more exciting applications and and advances so for that I want to turn very quickly to the game of Go which has gotten a lot
of interest over the past few years so the game of Go is basically where orl agents want to learn how to execute actions on this board game right it's a strategic board game to test against human Champions their performance and what was achieved several years ago was a really exciting result because you know this game of Go is a massive has a massive number of possible States it's a 19 by9 board game the number of possible uh or or let me start with first the objective the objective of the game is to basically collect more
area on the board than your opponent right so you have two player game white and black pieces and the objective is to basically occupy more territory than your opponent uh even though the grid right so the environment looks a lot simpler than kind of our real world environment it's just a two-dimensional grid and it has 19 by9 squares go is extrem complex even despite all of that and that's because of the huge number of different possible board positions itself so in fact in a full-size board there's a greater number of legal positions than there are
atoms in the universe right so the objective here is to see if you can train an AI that can Master this game of Go and put it up against human human agents and see if you could uh beat even the gold standard of humans so how can we how can we do this so a couple years ago this was an approach that was presented and the way it works is that you can develop a reinforcement learning policy that uh that is actually at its core it's not at all that much different from exactly the techniques
that you've learned about today so first you'll start by training a neural network that will watch games of humans playing the game of go right so you'll train a network that is now supervised this is not a reinforcement learning model you'll just record a bunch of humans playing the game of Go you'll record the states that they were in and the actions that they took and you'll train a supervised model that will mimic those actions based on those States so this is so far no reinforcement learning you're just building an imitation system essentially right so
this will never surpass humans but you can use it in the beginning just to learn some of the human strategies and techniques just at the beginning of training the next step is to use those pre-trained models that were learned by watching humans and actually now play them in a reinforcement learning fashion against each other right so it's this idea of self-play that you can now take two models that have some very basic idea of how to play the game of Go and you put them against each other and they play the game of go against
each other and the one that gets the reward is going to be the one that wins the game and you're going to increase the probability of all of the actions that the one the the agent that won you will increase all of their actions and you will decrease all of the actions of the agent that lost the game right regardless of where you know in the the game they won or lost all of the actions of the loser will be decreased all of the actions of the winner will be increased very simple algorithm and in
practice you run it for you know millions and billions of steps and eventually you achieve superhuman performance such that you can actually get some intuition about how to solve the game such that you're not only looking at you know state in action out but also you want to achieve this state in value out like how good of a board state are you in based on the state that you currently see yourself so a recent extension of this actually Tred to explore what would happen if you abandoned kind of the imitation learning part in the very
beginning right so the imitation part is where you watched a bunch of these human players of the game and see if you could bootstrap them essentially the beginning the beginnings of the learning of this model from the human experts so the question here is can you start from a completely random network still keep all of the self-play and the reinforcement learning but you now start from a randomly initialized model and what was shown is that these types of scenarios are also capable of you know achieving superhuman performance they take a bit obviously longer to train
but it's possible to train them entirely from scratch with absolutely no knowledge of the games such that they can overtake Human Performance and in fact they even overtake the original models that were bootstrapped by human performance so the human kind of imitation in the beginning of Learning helps in the beginning just to help accelerate learning but it's actually quite limiting right because and often times these models can figure out new strategies to these comp complex games that we as humans have not created yet right and you can see it now that same idea is being
you know deployed on a variety of different types of games right from go to chess and many others as well it's a very generalizable strategy but the remarkable thing is that you know the foundations the way those models are trained is nothing more than policy gradient optimization exactly in the way that we saw today no labels just increase the probability of everything that came with a win and decrease the prob of everything that comes with a loss so with that I'll just Briefly summarize the the learnings from today so we started just by you know
laying the lay laying the the foundations a little bit defining all of these terminologies and thinking about reinforcement learning is a completely new paradigm compared to supervised and unsupervised learning and then we basically cover two different ways that we can learn these topics and learn these policies right we covered both re uh sorry Q learning which covers the model and how the model can output Q values for each of the possible actions we talked about disadvantages and how policy learning could overtake some of those problems that comes with Q learning and how you can achieve
continuous value uh reinforcement learning and also stochastic action spaces and so on so a lot of exciting advances from policy gradients that can come in that field as well and I'll pause there the next lecture will be Ava who's going to to share New Frontiers of deep learning and kind of all of the new advances so it's the the the more recent I guess uh couple years of what has been going on in this field and yeah we'll just as always take a couple minute break switch speakers and thank you