[students murmuring] - Okay, so good afternoon everyone, let's get started. So hi, so for those of you who I haven't met yet, my name is Serena Yeung and I'm the third and final instructor for this class, and I'm also a PhD student in Fei-Fei's group. Okay, so today we're going to talk about backpropagation and neural networks, and so now we're really starting to get to some of the core material in this class. Before we begin, let's see, oh. So a few administrative details, so assignment one is due Thursday, April 20th, so a reminder, we shifted
the date back by a little bit and it's going to be due 11:59 p.m. on Canvas. So you should start thinking about your projects, there are TA specialties listed on the Piazza website so if you have questions about a specific project topic you're thinking about, you can go and try and find the TAs that might be most relevant. And then also for Google Cloud, so all students are going to get $100 in credits to use for Google Cloud for their assignments and project, so you should be receiving an email for that this week, I think.
A lot of you may have already, and then for those of you who haven't, they're going to come, should be by the end of this week. Okay so where we are, so far we've talked about how to define a classifier using a function f, parameterized by weights W, and this function f is going to take data x as input, and output a vector of scores for each of the classes that you want to classify. And so from here we can also define a loss function, so for example, the SVM loss function that we've talked about
which basically quantifies how happy or unhappy we are with the scores that we've produced, right, and then we can use that to define a total loss term. So L here, which is a combination of this data term, combined with a regularization term that expresses how simple our model is, and we have a preference for simpler models, for better generalization. And so now we want to find the parameters W that correspond to our lowest loss, right? We want to minimize the loss function, and so to do that we want to find the gradient of L with
respect to W. So last lecture we talked about how we can do this using optimization, and we're going to iteratively take steps in the direction of steepest descent, which is the negative of the gradient, in order to walk down this loss landscape and get to the point of lowest loss, right? And we saw how this gradient descent can basically take this trajectory, looking like this image on the right, getting to the bottom of your loss landscape. Oh! Okay, and so we also talked about different ways for computing a gradient, right? We can compute this numerically
using finite difference approximation which is slow and approximate, but at the same time it's really easy to write out, you know you can always get the gradient this way. We also talked about how to use the analytic gradient and computing this is, it's fast and exact once you've gotten the expression for the analytic gradient, but at the same time you have to do all the math and the calculus to derive this, so it's also, you know, easy to make mistakes, right? So in practice what we want to do is we want to derive the analytic
gradient and use this, but at the same time check our implementation using the numerical gradient to make sure that we've gotten all of our math right. So today we're going to talk about how to compute the analytic gradient for arbitrarily complex functions, using a framework that I'm going to call computational graphs. And so basically what a computational graph is, is that we can use this kind of graph in order to represent any function, where the nodes of the graph are steps of computation that we go through. So for example, in this example, the linear classifier
that we've talked about, the inputs here are x and W, right, and then this multiplication node represents the matrix multiplier, the multiplication of the parameters W with our data x that we have, outputting our vector of scores. And then we have another computational node which represents our hinge loss, right, computing our data loss term, Li. And we also have this regularization term at the bottom right, so this node which computes our regularization term, and then our total loss here at the end, L, is the sum of the regularization term and the data term. And the
advantage is that once we can express a function using a computational graph, then we can use a technique that we call backpropagation which is going to recursively use the chain rule in order to compute the gradient with respect to every variable in the computational graph, and so we're going to see how this is done. And this becomes very useful when we start working with really complex functions, so for example, convolutional neural networks that we're going to talk about later in this class. We have here the input image at the top, we have our loss at
the bottom, and the input has to go through many layers of transformations in order to get all the way down to the loss function. And this can get even crazier with things like, the, you know, like a neural turing machine, which is another kind of deep learning model, and in this case you can see that the computational graph for this is really insane, and especially, we end up, you know, unrolling this over time. It's basically completely impractical if you want to compute the gradients for any of these intermediate variables. Okay, so how does backpropagation work?
So we're going to start off with a simple example, where again, our goal is that we have a function. So in this case, f of x, y, z equals x plus y times z, and we want to find the gradients of the output of the function with respect to any of the variables. So the first step, always, is we want to take our function f, and we want to represent it using a computational graph. Right, so here our computational graph is on the right, and you can see that we have our, first we have the
plus node, so x plus y, and then we have this multiplication node, right, for the second computation that we're doing. And then, now we're going to do a forward pass of this network, so given the values of the variables that we have, so here, x equals negative two, y equals five and z equals negative four, I'm going to fill these all in in our computational graph, and then here we can compute an intermediate value, so x plus y gives three, and then finally we pass it through again, through the last node, the multiplication, to get
our final node of f equals negative 12. So here we want to give every intermediate variable a name. So here I've called this intermediate variable after the plus node q, and we have q equals x plus y, and then f equals q times z, using this intermediate node. And I've also written out here, the gradients of q with respect to x and y, which are just one because of the addition, and then the gradients of f with respect to q and z, which is z and q respectively because of the multiplication rule. And so what
we want to find, is we want to find the gradients of f with respect to x, y and z. So what backprop is, it's a recursive application of the chain rule, so we're going to start at the back, the very end of the computational graph, and then we're going to work our way backwards and compute all the gradients along the way. So here if we start at the very end, right, we want to compute the gradient of the output with respect to the last variable, which is just f. And so this gradient is just one,
it's trivial. So now, moving backwards, we want the gradient with respect to z, right, and we know that df over dz is equal to q. So the value of q is just three, and so we have here, df over dz equals three. And so next if we want to do df over dq, what is the value of that? What is df over dq? So we have here, df over dq is equal to z, right, and the value of z is negative four. So here we have df over dq is equal to negative four. Okay, so
now continuing to move backwards to the graph, we want to find df over dy, right, but here in this case, the gradient with respect to y, y is not connected directly to f, right? It's connected through an intermediate node of z, and so the way we're going to do this is we can leverage the chain rule which says that df over dy can be written as df over dq, times dq over dy, and so the intuition of this is that in order to get to find the effect of y on f, this is actually equivalent
to if we take the effect of q times q on f, which we already know, right? df over dq is equal to negative four, and we compound it with the effect of y on q, dq over dy. So what's dq over dy equal to in this case? - [Student] One. - One, right. Exactly. So dq over dy is equal to one, which means, you know, if we change y by a little bit, q is going to change by approximately the same amount right, this is the effect, and so what this is doing is this is
saying, well if I change y by a little bit, the effect of y on q is going to be one, and then the effect of q on f is going to be approximately a factor of negative four, right? So then we multiply these together and we get that the effect of y on f is going to be negative four. Okay, so now if we want to do the same thing for the gradient with respect to x, right, we can do the, we can follow the same procedure, and so what is this going to be? [students
speaking away from microphone] - I heard the same. Yeah exactly, so in this case we want to, again, apply the chain rule, right? We know the effect of q on f is negative four, and here again, since we have also the same addition node, dq over dx is equal to one, again, we have negative four times one, right, and the gradient with respect to x is going to be negative four. Okay, so what we're doing is, in backprop, is we basically have all of these nodes in our computational graph, but each node is only aware
of its immediate surroundings, right? So we have, at each node, we have the local inputs that are connected to this node, the values that are flowing into the node, and then we also have the output that is directly outputted from this node. So here our local inputs are x and y, and the output is z. And at this node we also know the local gradient, right, we can compute the gradient of z with respect to x, and the gradient of z with respect to y, and these are usually really simple operations, right? Each node is
going to be something like the addition or the multiplication that we had in that earlier example, which is something where we can just write down the gradient, and we don't have to, you know, go through very complex calculus in order to find this. - [Student] Can you go back and explain why more in the last slide was different than planning the first part of it using just normal calculus? - Yeah, so basically if we go back, hold on, let me... So if we go back here, we could exactly write out, find all of these using
just calculus, so we could say, you know, we want df over dx, right, and we can probably expand out this expression and see that it's just going to be z, but we can do this for, in this case, because it's simple, but we'll see examples later on where once this becomes a really complicated expression, you don't want to have to use calculus to derive, right, the gradient for something, for a super-complicated expression, and instead, if you use this formalism and you break it down into these computational nodes, then you can only ever work with gradients
of very simple computations, right, at the level of, you know, additions, multiplications, exponentials, things as simple as you want them, and then you just use the chain rule to multiply all these together, and get your, the value of your gradient without having to ever derive the entire expression. Does that make sense? [student murmuring] Okay, so we'll see an example of this later. And so, was there another question, yeah? [student speaking away from microphone] - [Student] What's the negative four next to the z representing? - Negative, okay yeah, so the negative four, these were the, the
green values on top were all the values of the function as we passed it forward through the computational graph, right? So we said up here that x is equal to negative two, y is equal to five, and z equals negative four, so we filled in all of these values, and then we just wanted to compute the value of this function. Right, so we said this value of q is going to be x plus y, it's going to be negative two plus five, it is going to be three, and we have z is equal to negative
four so we fill that in here, and then we multiplied q and z together, negative four times three in order to get the final value of f, right? And then the red values underneath were as we were filling in the gradients as we were working backwards. Okay. Okay, so right, so we said that, you know, we have these local, these nodes, and each node basically gets its local inputs coming in and the output that it sees directly passing on to the next node, and we also have these local gradients that we computed, right, the gradient
of the immediate output of the node with respect to the inputs coming in. And so what happens during backprop is we have these, we'll start from the back of the graph, right, and then we work our way from the end all the way back to the beginning, and when we reach each node, at each node we have the upstream gradients coming back, right, with respect to the immediate output of the node. So by the time we reach this node in backprop, we've already computed the gradient of our final loss l, with respect to z, right?
And so now what we want to find next is we want to find the gradients with respect to just before the node, to the values of x and y. And so as we saw earlier, we do this using the chain rule, right, we have from the chain rule, that the gradient of this loss function with respect to x is going to be the gradient with respect to z times, compounded by this gradient, local gradient of z with respect to x. Right, so in the chain rule we always take this upstream gradient coming down, and we
multiply it by the local gradient in order to get the gradient with respect to the input. - [Student] So, sorry, is it, it's different because this would never work to get a general formula into the, or general symbolic formula for the gradient. It only works with instantaneous values, where you like. [student coughing] Or passing a little constant value as a symbolic. - So the question is whether this only works because we're working with the current values of the function, and so it works, right, given the current values of the function that we plug in, but
we can write an expression for this, still in terms of the variables, right? So we'll see that gradient of L with respect to z is going to be some expression, and gradient of z with respect to x is going to be another expression, right? But we plug in these, we plug in the values of these numbers at the time in order to get the value of the gradient with respect to x. So what you could do is you could recursively plug in all of these expressions, right? Gradient with respect, z with respect to x is
going to be a simple, simple expression, right? So in this case, if we have a multiplication node, gradient of z with respect to x is just going to be y, right, we know that, but the gradient of L with respect to z, this is probably a complex part of the graph in itself, right, so here's where we want to just, in this case, have this numerical, right? So as you said, basically this is going to be just a number coming down, right, a value, and then we just multiply it with the expression that we have
for the local gradient. And I think this will be more clear when we go through a more complicated example in a few slides. Okay, so now the gradient of L with respect to y, we have exactly the same idea, where again, we use the chain rule, we have gradient of L with respect to z, times the gradient of z with respect to y, right, we use the chain rule, multiply these together and get our gradient. And then once we have these, we'll pass these on to the node directly before, or connected to this node. And
so the main thing to take away from this is that at each node we just want to have our local gradient that we compute, just keep track of this, and then during backprop as we're receiving, you know, numerical values of gradients coming from upstream, we just take what that is, multiply it by the local gradient, and then this is what we then send back to the connected nodes, the next nodes going backwards, without having to care about anything else besides these immediate surroundings. So now we're going to go through another example, this time a little
bit more complex, so we can see more why backprop is so useful. So in this case, our function is f of w and x, which is equal to one over one plus e to the negative of w-zero times x-zero plus w-one x-one, plus w-two, right? So again, the first step always is we want to write this out as a computational graph. So in this case we can see that in this graph, right, first we multiply together the w and x terms that we have, w-zero with x-zero, w-one with x-one, and w-two, then we add all
of these together, right? Then we do, scale it by negative one, we take the exponential, we add one, and then finally we do one over this whole term. And then here I've also filled in values of these, so let's say given values that we have for the ws and xs, right, we can make a forward pass and basically compute what the value is at every stage of the computation. And here I've also written down here at the bottom the values, the expressions for some derivatives that are going to be helpful later on, so same as
we did before with the simple example. Okay, so now then we're going to do backprop through here, right, so again, we're going to start at the very end of the graph, and so here again the gradient of the output with respect to the last variable is just one, it's just trivial, and so now moving backwards one step, right? So what's the gradient with respect to the input just before one over x? Well, so in this case, we know that the upstream gradient that we have coming down, right, is this red one, right? This is the
upstream gradient that we have flowing down, and then now we need to find the local gradient, right, and the local gradient of this node, this node is one over x, right, so we have f of x equals one over x here in red, and the local gradient of this df over dx is equal to negative one over x-squared, right? So here we're going to take negative one over x-squared, and plug in the value of x that we had during this forward pass, 1.37, and so our final gradient with respect to this variable is going to
be negative one over 1.37 squared times one equals negative 0.53. So moving back to the next node, we're going to go through the exact same process, right? So here, the gradient flowing from upstream is going to be negative 0.53, right, and here the local gradient, the node here is a plus one, and so now looking at our reference of derivatives at the bottom, we have that for a constant plus x, the local gradient is just one, right? So what's the gradient with respect to this variable using the chain rule? So it's going to be the
upstream gradient of negative 0.53 times our local gradient of one, which is equal to negative 0.53. So let's keep moving backwards one more step. So here we have the exponential, right? So what's the upstream gradient coming down? [student speaking away from microphone] Right, so the upstream gradient is negative 0.53, what's the local gradient here? It's going to be the local gradient of e to the x, right? This is an exponential node, and so our chain rule is going to tell us that our gradient is going to be negative 0.53 times e to the power of
x, which in this case is negative one, from our forward pass, and this is going to give us our final gradient of negative 0.2. Okay, so now one more node here, the next node is, that we reach, is going to be a multiplication with negative one, right? So here, what's the upstream gradient coming down? - [Student] Negative 0.2? - [Serena] Negative 0.2, right, and what's going to be the local gradient, can look at the reference sheet. It's going to be, what was it? I think I heard it. - [Student] That's minus one? - It's going
to be minus one, exactly, yeah, because our local gradient says it's going to be, df over dx is a, right, and the value of a that we scaled x by is negative one here. So we have here that the gradient is negative one times negative 0.2, and so our gradient is 0.2. Okay, so now we've reached an addition node, and so in this case we have these two branches both connected to it, right? So what's the upstream gradient here? It's going to be 0.2, right, just as everything else, and here now the gradient with respect
to each of these branches, it's an addition, right, and we saw from before in our simple example that when we have an addition node, the gradient with respect to each of the inputs to the addition is just going to be one, right? So here, our local gradient for looking at our top stream is going to be one times the upstream gradient of 0.2, which is going to give a total gradient of 0.2, right? And then we, for our bottom branch we'd do the same thing, right, our upstream gradient is 0.2, our local gradient is one
again, and the total gradient is 0.2. So is everything clear about this? Okay. So we have a few more gradients to fill out, so moving back now we've reached w-zero and x-zero, and so here we have a multiplication node, right, so we saw the multiplication node from before, it just, the gradient with respect to one of the inputs just is the value of the other input. And so in this case, what's the gradient with respect to w-zero? - [Student] Minus 0.2. - Minus, I'm hearing minus 0.2, exactly. Yeah, so with respect to w-zero, we have
our upstream gradient, 0.2, right, times our, this is the bottom one, times our value of x, which is negative one, we get negative 0.2 and we can do the same thing for our gradient with respect to x-zero. It's going to be 0.2 times the value of w-zero which is two, and we get 0.4. Okay, so here we've filled out most of these gradients, and so there was the question earlier about why this is simpler than just computing, deriving the analytic gradient, the expression with respect to any of these variables, right? And so you can see
here, all we ever dealt with was expressions for local gradients that we had to write out, so once we had these expressions for local gradients, all we did was plug in the values for each of these that we have, and use the chain rule to numerically multiply this all the way backwards and get the gradients with respect to all of the variables. And so, you know, we can also fill out the gradients with respect to w-one and x-one here in exactly the same way, and so one thing that I want to note is that right
when we're creating these computational graphs, we can define the computational nodes at any granularity that we want to. So in this case, we broke it down into the absolute simplest that we could, right, we broke it down into additions and multiplications, you know, it basically can't get any simpler than that, but in practice, right, we can group some of these nodes together into more complex nodes if we want. As long as we're able to write down the local gradient for that node, right? And so as an example, if we look at a sigmoid function, so
I've defined the sigmoid function in the upper-right here, of a sigmoid of x is equal to one over one plus e to the negative x, and this is something that's a really common function that you'll see a lot in the rest of this class, and we can compute the gradient for this, we can write it out, and if we do actually go through the math of doing this analytically, we can get a nice expression at the end. So in this case it's equal to one minus sigma of x, so the output of this function times
sigma of x, right? And so in cases where we have something like this, we could just take all the computations that we had in our graph that made up this sigmoid, and we could just replace it with one big node that's a sigmoid, right, because we do know the local gradient for this gate, it's this expression, d of the sigmoid of x over dx, right? So basically the important thing here is that you can, group any nodes that you want to make any sorts of a little bit more complex nodes, as long as you can
write down the local gradient for this. And so all this is is basically a trade-off between, you know, how much math that you want to do in order to get a more, kind of concise and simpler graph, right, versus how simple you want each of your gradients to be, right? And then you can write out as complex of a computational graph that you want. Yeah, question? - [Student] This is a question on the graph itself, is there a reason that the first two multiplication nodes and the weights are not connected to a single addition node?
- So they could also be connected into a single addition node, so the question was, is there a reason why w-zero and x-zero are not connected with w-two? All of these additions just connected together, and yeah, so the reason, the answer is that you can do that if you want, and in practice, maybe you would actually want to do that because this is still a very simple node, right? So in this case I just wrote this out into as simple as possible, where each node only had up to two inputs, but yeah, you could definitely
do that. Any other questions about this? Okay, so the one thing that I really like about thinking about this like a computational graph is that I feel very comforted, right, like anytime I have to take a gradient, find gradients of something, even if the expression that I want to compute gradients of is really hairy, and really scary, you know, whether it's something like this sigmoid or something worse, I know that, you know, I could derive this if I want to, but really, if I just sit down and write it out in terms of a computational
graph, I can go as simple as I need to to always be able to apply backprop and the chain rule, and be able to compute all the gradients that I need. And so this is something that you guys should think about when you're doing your homeworks, as basically, you know, anytime you're having trouble finding gradients of something just think about it as a computational graph, break it down into all of these parts, and then use the chain rule. Okay, and so, you know, so we talked about how we could group these set of nodes together
into a sigmoid gate, and just to confirm, like, that this is actually exactly equivalent, we can plug this in, right? So we have that our input here to the sigmoid gate is going to be one, in green, and then we have that the output is going to be here, 0.73, right, and this'll work out if you plug it in to the sigmoid function. And so now if we want to do, if we want to take the gradient, and we want to treat this entire sigmoid as one node, now what we should do is we need
to use this local gradient that we've derived up here, right? One minus sigmoid of x times the sigmoid of x. So if we plug this in, and here we know that the value of sigmoid of x was 0.73, so if we plug this value in we'll see that this, the value of this gradient is equal to 0.2, right, and so the value of this local gradient is 0.2, we multiply it by the x upstream gradient which is one, and we're going to get out exactly the same value of the gradient with respect to before the
sigmoid gate, as if we broke it down into all of the smaller computations. Okay, and so as we're looking at what's happening, right, as we're taking these gradients going backwards through our computational graph, there's some patterns that you'll notice where there's some intuitive interpretation that we can give these, right? So we saw that the add gate is a gradient distributor right, when we passed through this addition gate here, which had two branches coming out of it, it took the gradient, the upstream gradient and it just distributed it, passed the exact same thing to both of
the branches that were connected. So here's a couple more that we can think about. So what's a max gate look like? So we have a max gate here at the bottom, right, where the input's coming in are z and w, z has a value of two, w has a value of negative one, and then we took the max of this, which is two, right, and so we pass this down into the remainder of our computational graph. So now if we're taking the gradients with respect to this, the upstream gradient is, let's say two coming back,
right, and what does this local gradient look like? So anyone, yes? - [Student] It'll be zero for one, and one for the other? - Right. [student speaking away from microphone] Exactly, so the answer that was given is that z will have a gradient of two, w will have a value, a gradient of zero, and so one of these is going to get the full value of the gradient just passed back, and routed to that variable, and then the other one will have a gradient of zero, and so, so we can think of this as kind
of a gradient router, right, so, whereas the addition node passed back the same gradient to both branches coming in, the max gate will just take the gradient and route it to one of the branches, and this makes sense because if we look at our forward pass, what's happening is that only the value that was the maximum got passed down to the rest of the computational graph, right? So it's the only value that actually affected our function computation at the end, and so it makes sense that when we're passing our gradients back, we just want to
adjust what, you know, flow it through that branch of the computation. Okay, and so another one, what's a multiplication gate, which we saw earlier, is there any interpretation of this? [student speaking away from microphone] Okay, so the answer that was given is that the local gradient is basically just the value of the other variable. Yeah, so that's exactly right. So we can think of this as a gradient switcher, right? A switcher, and I guess a scaler, where we take the upstream gradient and we scale it by the value of the other branch. Okay, and so
one other thing to note is that when we have a place where one node is connected to multiple nodes, the gradients add up at this node, right? So at these branches, using the multivariate chain rule, we're just going to take the value of the upstream gradient coming back from each of these nodes, and we'll add these together to get the total upstream gradient that's flowing back into this node, and you can see this from the multivariate chain rule and also thinking about this, you can think about this that if you're going to change this node
a little bit, it's going to affect both of these connected nodes in the forward pass, right, when you're making your forward pass through the graph. And so then when you're doing backprop, right, then now the, both of these gradients coming back are going to affect this node, right, and so that's how we're going to sum these up to be the total upstream gradient flowing back into this node. Okay, so any questions about backprop, going through these forward and backward passes? - [Student] So we haven't did anything to actually update the weights. [speaking away from microphone]
- Right, so the question is, we haven't done anything yet to update the values of these weights, we've only found the gradients with respect to the variables, that's exactly right. So what we've talked about so far in this lecture is how to compute gradients with respect to any variables in our function, right, and then once we have these we can just apply everything we learned in the optimization lecture, last lecture, right? So given the gradient, we now take a step in the direction of the gradient in order to update our weight, our parameters, right? So
you can just take this entire framework that we learned about last lecture for optimization, and what we've done here is just learn how to compute the gradients we need for arbitrarily complex functions, right, and so this is going to be useful when we talk about complex functions like neural networks later on. Yeah? - [Student] Do you mind writing out the, all the variate, so you could help explain this slide a little better? - Yeah, so I can write this maybe on the board. Right, so basically if we're going to have, let's see, if we're going
to have the gradient of f with respect to some variable x, right, and let's say it's connected through variables, let's see, i, we can basically... Right, so this is basically saying that if x is connected to these multiple elements, right, which in this case, different q-is, then the chain rule is taking all, it's going to take the effect of each of these intermediate variables, right, on our final output f, and then compound each one with the local effect of our variable x on that intermediate value, right? So yeah, it's basically just summing all these up
together. Okay, so now that we've, you know, done all these examples in the scalar case, we're going to look at what happens when we have vectors, right? So now if our variables x, y and z, instead of just being numbers, we have vectors for these. And so everything stays exactly the same, the entire flow, the only difference is that now our gradients are going to be Jacobian matrices, right, so these are now going to be matrices containing the derivative of each element of, for example z with respect to each element of x. Okay, and so
to, you know, so give an example of something where this is happening, right, let's say that we have our input is going to now be a vector, so let's say we have a 4096-dimensional input vector, and this is kind of a common size that you might see in convolutional neural networks later on, and our node is going to be an element-wise maximum, right? So we have f of x is equal to the maximum of x compared with zero element-wise, and then our output is going to be also a 4096-dimensional vector. Okay, so in this case,
what's the size of our Jacobian matrix? Remember I said earlier, the Jacobian matrix is going to be, like each row is, it's going to be partial derivatives, a matrix of partial derivatives of each dimension of the output with respect to each dimension of the input. Okay, so the answer I heard was 4,096 squared, and that's, yeah, that's correct. So this is pretty large, right, 4,096 by 4,096 and in practice this is going to be even larger because we're going to work with many batches of, you know, of, for example, 100 inputs at the same time,
right, and we'll put all of these through our node at the same time to be more efficient, and so this is going to scale this by 100, and in practice our Jacobian's actually going to turn out to be something like 409,000 by 409,000 right, so this is really huge, and basically completely impractical to work with. So in practice though, we don't actually need to compute this huge Jacobian most of the time, and so why is that, like, what does this Jacobian matrix look like? If we think about what's happening here, where we're taking this element-wise
maximum, and we think about what are each of the partial derivatives, right, which dimension of the inputs affect which dimensions of the output? What sort of structure can we see in our Jacobian matrix? [student speaking away from microphone] Okay, so I heard that it's diagonal, right, exactly. So because this is element-wise, right, each element of the input, say the first dimension, only affects that corresponding element in the output, right? And so because of that our Jacobian matrix, which is just going to be a diagonal matrix. And so in practice then, we don't actually have to
write out and formulate this entire Jacobian, we can just know the effect of x on the output, right, and then we can just use these values, right, and fill it in as we're computing the gradient. Okay, so now we're going to go through a more concrete vectorized example of a computational graph. Right, so let's look at a case where we have the function f of x and W is equal to, basically the L-two of W multiplied by x, and so in this case we're going to say x is n-dimensional and W is n by n.
Right, so again our first step, writing out the computational graph, right? We have W multiplied by x, and then followed by, I'm just going to call this L-two. And so now let's also fill out some values for this, so we can see that, you know, let's say have W be this two by two matrix, and x is going to be this two-dimensional vector, right? And so we can say, label again our intermediate nodes. So our intermediate node after the multiplication it's going to be q, we have q equals W times x, which we can write
out element-wise this way, where the first element is just W-one-one times x-one plus W-one-two times x-two and so on, and then we can now express f in relation to q, right? So looking at the second node we have f of q is equal to the L-two norm of q, which is equal to q-one squared plus q-two squared. Okay, so we filled this in, right, we get q and then we get our final output. Okay, so now let's do backprop through this, right? So again, this is always the first step, we have the gradient with respect
to our output is just one. Okay, so now let's move back one node, so now we want to find the gradient with respect to q, right, our intermediate variable before the L-two. And so q is a two-dimensional vector, and what we want to do is we want to find how each element of q affects our final value of f, right, and so if we look at this expression that we've written out for f here at the bottom, we can see that the gradient of f with respect to a specific q-i, let's say q-one, is just
going to be two times q-i, right? This is just taking this derivative here, and so we have this expression for, with respect to each element of q-i, we could also, you know, write this out in vector form if we want to, it's just going to be two times our vector of q, right, if we want to write this out in vector form, and so what we get is that our gradient is 0.44, and 0.52, this vector, right? And so you can see that it just took q and it scaled it by two, right? Each element
is just multiplied by two. So the gradient of a vector is always going to be the same size as the original vector, and each element of this gradient is going to, it means how much of this particular element affects our final output of the function. Okay, so now let's move one step backwards, right, what's the gradient with respect to W? And so here again we want to use the same concept of trying to apply the chain rule, right, so we want to compute our local gradient of q with respect to W, and so let's look
at this again element-wise, and if we do that, let's see what's the effect of each q, right, each element of q with respect to each element of W, and so this is going to be the Jacobian that we talked about earlier, and if we look at this in this multiplication, q is equal to W times x, right, what's the derivative, or the gradient of the first element of q, so our first element up top, with respect to W-one-one? So q-one with respect to W-one-one? What's that value? X-one, exactly. Yeah, so we know that this is
x-one, and we can write this out more generally of the gradient of q-k with respect to W-i,j is equal to X-j. And then now if we want to find the gradient with respect to, of f, with respect to each W-i,j. So looking at these derivatives now, we can use this chain rule that we talked earlier where we basically compound df over dq-k for each element of q with dq-k over W-i,j for each element of W-i,j, right? So we find the effect of each element of W on each element of q, and sum this across all
q. And so if you write this out, this is going to give this expression of two times q-i times x-j. Okay, and so filling this out then we get this gradient with respect to W, and so again we can compute this each element-wise, or we can also look at this expression that we've derived and write it out in vectorized form, right? So okay, and remember, the important thing is always to check the gradient with respect to a variable should have the same shape as the variable, and something, so this is something really useful in practice
to sanity check, right, like once you've computed what your gradient should be, check that this is the same shape as your variable, because again, the element, each element of your gradient is quantifying how much that element is contributing to your, is affecting your final output. Yeah? [student speaking away from microphone] The both sides, oh the both sides one is an indicator function, so this is saying that it's just one if k equals i. Okay, so let's see, so we've done that, and so now just see, one more example. Now our last thing we need to
find is the gradient with respect to q-I. So here if we compute the partial derivatives we can see that dq-k over dx-i is equal to W-k,i, right, using the same way as we did it for W, and then again we can just use the chain rule and get the total expression for that, right? And so this is going to be the gradient with respect to x, again, of the same shape as x, and we can also write this out in vectorized form if we want. Okay, so any questions about this, yeah? [student speaking away from
microphone] So we are computing the Jacobian, so let me go back here, right, so if we're doing, so right, so we have these partial derivatives of q-k with respect to x-i, right, and these are forming your, the entries of your Jacobian, right? And so in practice what we're going to do is we basically take that, and you're going to see it up there in the chain rule, so the vectorized expression of gradient with respect to x, right, this is going to have the Jacobian here which is this transposed value here, so you can write it
out in vectorized form. [student speaking away from microphone] So well, so in this case the matrix is going to be the same size as W right, so it's not actually a large matrix in this case, right? Okay, so the way that we've been thinking about this is like a really modularized implementation, right, where in our computational graph, right, we look at each node locally and we compute the local gradients and chain them with upstream gradients coming down, and so you can think of this as basically a forward and a backwards API, right? In the forward
pass we implement the, you know, a function computing the output of this node, and then in the backwards pass we compute the gradient. And so when we actually implement this in code, we're going to do this in exactly the same way. So we can basically think about, for each gate, right, if we implement a forward function and a backward function, where the backward function is computing the chain rule, then if we have our entire graph, we can just make a forward pass through the entire graph by iterating through all the nodes in the graph, all
the gates. Here I'm going to use the word gate and node, kind of interchangeably, we can iterate through all of these gates and just call forward on each of the gates, right? And we just want to do this in topologically sorted order, so we process all of the inputs coming in to a node before we process that node. And then going backwards, we're just going to then go through all of the gates in this reverse sorted order, and then call backwards on each of these gates. Okay, and so if we look at then the implementation
for our particular gates, so for example, this MultiplyGate here, we want to implement the forward pass, right, so it gets x and y as inputs, and returns the value of z, and then when we go backwards, right, we get as input dz, which is our upstream gradient, and we want to output the gradients on the input's x and y to pass down, right? So we're going to output dx and dy, and so in this case, in this example, everything is back to the scalar case here, and so if we look at this in the forward
pass, one thing that's important is that we need to, we should cache the values of the forward pass, right, because we end up using this in the backward pass a lot of the time. So here in the forward pass, we want to cache the values of x and y, right, and in the backward pass, using the chain rule, we're going to, remember, take the value of the upstream gradient and scale it by the value of the other branch, right, and so we'll keep, for dx we'll take our value of self.y that we kept, and multiply
it by dz coming down, and same for dy. Okay, so if you look at a lot of deep-learning frameworks and libraries you'll see that they exactly follow this kind of modularization, right? So for example, Caffe is a popular deep learning framework, and you'll see, if you go look through the Caffe source code you'll get to some directory that says layers, and in layers, which are basically computational nodes, usually layers might be slightly more, you know, some of these more complex computational nodes like the sigmoid that we talked about earlier, you'll see, basically just a whole
list of all different kinds of computational nodes, right? So you might have the sigmoid, and I know there might be here, there's like a convolution is one, there's an Argmax is another layer, you'll have all of these layers and if you dig in to each of them, they're just exactly implementing a forward pass and a backward pass, and then all of these are called when we do forward and backward pass through the entire network that we formed, and so our network is just basically going to be stacking up all of these, the different layers that
we choose to use in the network. So for example, if we look at a specific one, in this case a sigmoid layer, you'll see that in the sigmoid layer, right, we've talked about the sigmoid function, you'll see that there's a forward pass which basically computes exactly the sigmoid expression, and then a backward pass, right, where it is taking as input something, basically a top_diff, which is our upstream gradient in this case, and multiplying it by a local gradient that we compute. So in assignment one you'll get practice with this kind of, this computational graph way
of thinking where, you know, you're going to be writing your SVM and Softmax classes, and taking the gradients of these. And so again, remember always you want to first step, represent it as a computational graph, right? Figure out what are all the computations that you did leading up to the output, and then when you, when it's time to do your backward pass, just take the gradient with respect to each of these intermediate variables that you've defined in your computational graph, and use the chain rule to link them all together. Okay, so summary of what we've
talked about so far. When we get down to, you know, working with neural networks, these are going to be really large and complex, so it's going to be impractical to write down the gradient formula by hand for all your parameters. So in order to get these gradients, right, we talked about how, what we should use is backpropagation, right, and this is kind of one of the core techniques of, you know, neural networks, is basically using backpropagation to get your gradients, right? And so this is a recursive application of the chain rule where we have this
computational graph, and we start at the back and we go backwards through it to compute the gradients with respect to all of the intermediate variables, which are your inputs, your parameters, and everything else in the middle. And we've also talked about how really this implementation and this graph structure, each of these nodes is really, you can see this as implementing a forward and backwards API, right? And so in the forward pass we want to compute the results of the operation, and we want to save any intermediate values that we might want to use later in
our gradient computation, and then in the backwards pass we apply this chain rule and we take this upstream gradient, we chain it, multiply it with our local gradient to compute the gradient with respect to the inputs of the node, and we pass this down to the nodes that are connected next. Okay, so now finally we're going to talk about neural networks. All right, so really, you know, neural networks, people draw a lot of analogies between neural networks and the brain, and different types of biological inspirations, and we'll get to that in a little bit, but
first let's talk about it, you know, just looking at it as a function, as a class of functions without all of the brain stuff. So, so far we've talked about, you know, we've worked a lot with this linear score function, right? f equals W times x, and so we've been using this as a running example of a function that we want to optimize. So instead of using the single in your transformation, if we want a neural network where we can just, as the simplest form, just stack two of these together, right? Just a linear transformation
on top of another one in order to get a two-layer neural network, right? And so what this looks like is first we have our, you know, a matrix multiply of W-one with x, and then we get this intermediate variable and we have this non-linear function of a max of zero with W, max with this output of this linear layer, and it's really important to have these non-linearities in place, which we'll talk about more later, because otherwise if you just stack linear layers on top of each other, they're just going to collapse to, like a single
linear function. Okay, so we have our first linear layer and then we have this non-linearity, right, and then on top of this we'll add another linear layer. And then from here, finally we can get our score function, our output vector of scores. So basically, like, more broadly speaking, neural networks are a class of functions where we have simpler functions, right, that are stacked on top of each other, and we stack them in a hierarchical way in order to make up a more complex non-linear function, and so this is the idea of having, basically multiple stages
of hierarchical computation, right? And so, you know, so this is kind of the main way that we do this is by taking something like this matrix multiply, this linear layer, and we just stack multiple of these on top of each other with non-linear functions in-between, right? And so one thing that this can help solve is if we look, if we remember back to this linear score function that we were talking about, right, remember we discussed earlier how each row of our weight matrix W was something like a template. It was a template that sort of
expressed, you know, what we're looking for in the input for a specific class, right, so for example, you know, the car template looks something like this kind of fuzzy red car, and we were looking for this in the input to compute the score for the car class. And we talked about one of the problems with this is that there's only one template, right? There's this red car, whereas in practice, we actually have multiple modes, right? We might want, we're looking for, you know, a red car, there's also a yellow car, like all of these are
different kinds of cars, and so what this kind of multiple layer network lets you do is now, you know, each of this intermediate variable h, right, W-one can still be these kinds of templates, but now you have all of these scores for these templates in h, and we can have another layer on top that's combining these together, right? So we can say that actually my car class should be, you know, connected to, we're looking for both red cars as well as yellow cars, right, because we have this matrix W-two which is now a weighting of
all of our vector in h. Okay, any questions about this? Yeah? [student speaking away from microphone] Yeah, so there's a lot of ways, so there's a lot of different non-linear functions that you can choose from, and we'll talk later on in a later lecture about all the different kinds of non-linearities that you might want to use. - [Student] For the pictures in the slide, so, on the bottom row you have images of your vector W-one weight, and so maybe you would have images of another vector W-two? - So W-one, because it's directly connected to the
input x, this is what's like, really interpretable, because you can formulate all of these templates. W-two, so h is going to be a score of how much of each template you solve, for example, all right, so it might be like you have a, you know, like a, I don't know, two for the red car, and like, one for the yellow car or something like that. - [Student] Oh, okay, so instead of W-one being just 10, like, you would have a left-facing horse and a right-facing horse, and they'd both be included-- - Exactly, so the question
is basically whether in W-one you could have both left-facing horse and right-facing horse, right, and so yeah, exactly. So now W-one can be many different kinds of templates right? They're not, and then W-two, now we can, like basically it's a weighted sum of all of these templates. So now it allows you to weight together multiple templates in order to get the final score for a particular class. - [Student] So if you're processing an image then it's actually left-facing horse. It'll get a really high score with the left-facing horse template, and a lower score with the
right-facing horse template, and then this will take the maximum of the two? - Right, so okay, so the question is, if our image x is like a left-facing horse and in W-one we have a template of a left-facing horse and a right-facing horse, then what's happening, right? So what happens is yeah, so in h you might have a really high score for your left-facing horse, kind of a lower score for your right-facing horse, and W-two is, it's a weighted sum, so it's not a maximum. It's a weighted sum of these templates, but if you have
either a really high score for one of these templates, or let's say you have, kind of a lower and medium score for both of these templates, all of these kinds of combinations are going to give high scores, right? And so in the end what you're going to get is something that generally scores high when you have a horse of any kind. So let's say you had a front-facing horse, you might have medium values for both the left and the right templates. Yeah, question? - [Student] So is W-two doing the weighting, or is h doing the
weighting? - W-two is doing the weighting, so the question is, "Is W-two doing the weighting or is h doing the weighting?" h is the value, like in this example, h is the value of scores for each of your templates that you have in W-one, right? So h is like the score function, right, it's how much of each template in W-one is present, and then W-two is going to weight all of these, weight all of these intermediate scores to get your final score for the class. - [Student] And which is the non-linear thing? - So the
question is, "which is the non-linear thing?" So the non-linearity usually happens right before h, so h is the value right after the non-linearity. So we're talking about this, like, you know, intuitively as this example of like, W-one is looking for, you know, has these same templates as before, and W-two is a weighting for these. In practice it's not exactly like this, right, because as you said, there's all these non-linearities thrown in and so on, but it has this approximate type of interpretation to it. - [Student] So h is just W-one-x then? - Yeah, yeah, so
the question is h just W-one-x? So h is just W-one times x, with the max function on top. Oh, let me just, okay so, so we've talked about this as an example of a two-layer neural network, and we can stack more layers of these to get deeper networks of arbitrary depth, right? So we can just do this one more time at another non-linearity and matrix multiply now by W-three, and now we have a three-layer neural network, right? And so this is where the term deep neural networks is basically coming from, right? This idea that you
can stack multiple of these layers, you know, for very deep networks. And so in homework you'll get a practice of writing and you know, training one of these neural networks, I think in assignment two, but basically a full implementation of this using this idea of forward pass, right, and backward passes, and using chain rule to compute gradients that we've already seen. The entire implementation of a two-layer neural network is actually really simple, it can just be done in 20 lines, and so you'll get some practice with this in assignment two, writing out all of these
parts. And okay, so now that we've sort of seen what neural networks are as a function, right, like, you know, we hear people talking a lot about how there's biological inspirations for neural networks, and so even though it's important that to emphasize that these analogies are really loose, it's really just very loose ties, but it's still interesting to understand where some of these connections and inspirations come from. And so now I'm going to talk briefly about that. So if we think about a neuron, in kind of a very simple way, this neuron is, here's a
diagram of a neuron. We have the impulses that are carried towards each neuron, right, so we have a lot of neurons connected together and each neuron has dendrites, right, and these are sort of, these are what receives the impulses that come into the neuron. And then we have a cell body, right, that basically integrates these signals coming in and then there's a kind of, then it takes this, and after integrating all these signals, it passes on, you know, the impulse carries away from the cell body to downstream neurons that it's connected to, right, and it
carries this away through axons. So now if we look at what we've been doing so far, right, with each computational node, you can see that this actually has, you can see it in kind of a similar way, right? Where nodes are connected to each other in the computational graph, and we have inputs, or signals, x, x right, coming into a neuron, and then all of these x, right, x-zero, x-one, x-two, these are combined and integrated together, right, using, for example our weights, W. So we do some sort of computation, right, and in some of the
computations we've been doing so far, something like W times x plus b, right, integrating all these together, and then we have an activation function that we apply on top, we get this value of this output, and we pass it down to the connecting neurons. So if you look at that this, this is actually, you can think about this in a very similar way, right? Like, you know, these are what's the signals coming in are kind of the, connected at synapses, right? The synapse connecting the multiple neurons, the dendrites are integrating all of these, they're integrating
all of this information together in the cell body, and then we have the output carried on the output later on. And so this is kind of the analogy that you can draw between them, and if you look at these activation functions, right? This is what basically takes all the inputs coming in and outputs one number that's going out later on, and we've talked about examples like sigmoid activation function, right, and different kinds of non-linearities, and so sort of one kind of loose analogy that you can draw is that these non-linearities can represent something sort of
like the firing, or spiking rate of the neurons, right? Where our neurons transmit signals to connecting neurons using kind of these discrete spikes, right? And so we can think of, you know, if they're spiking very fast then there's kind of a strong signal that's passed later on, and so we can think of this value after our activation function as sort of, in a sense, sort of this firing rate that we're going to pass on. And you know in practice, I think neuroscientists who are actually studying this say that kind of one of the non-linearities that
are most similar to the way that neurons are actually behaving is a ReLU function, which is a ReLU non-linearity, which is something that we're going to look at more later on, but it's a function that's at zero for all negative values of input, and then it's a linear function for everything that's in kind of a positive regime. And so, you know, we'll talk more about this activation function later on, but that's kind of, in practice, maybe the one that's most similar to how neurons are actually behaving. But it's really important to be extremely careful with
making any of these sorts of brain analogies, because in practice biological neurons are way more complex than this. There's many different kinds of biological neurons, the dendrites can perform really complex non-linear computations. Our synapses, right, the W-zeros that we had earlier where we drew this analogy, are not single weights like we had, they're actually really complex, you know, non-linear dynamical systems in practice, and also this idea of interpreting our activation function as a sort of rate code or firing rate is also, is insufficient in practice, you know. It's just this kind of firing rate is
probably not a sufficient model of how neurons will actually communicate to downstream neurons, right, like even as a very simple way, there's a very, the neurons will fire at a variable rate, and this variability probably should be taken into account. And so there's all of these, you know, it's kind of a much more complex thing than what we're dealing with. There's references, for example this dendritic computation that you can look at if you're interested in this topic, but yeah, so that in practice, you know, we can sort of see how it may resemble a neuron
at this very high level, but neurons are, in practice, much more complicated than that. Okay, so we talked about how there's many different kinds of activation functions that could be used, there's the ReLU that I mentioned earlier, and we'll talk about all of these different kinds of activation functions in much more detail later on, choices of these activation functions that you might want to use. And so we'll also talk about different kinds of neural network architectures. So we gave the example of these fully connected neural networks, right, where each layer is this matrix multiply, and
so the way we actually want to call these is, we said two-layer neural network before, and that corresponded to the fact that we have two of these linear layers, right, where we're doing a matrix multiply, two fully connected layers is what we call these. We could also call this a one-hidden-layer neural network, so instead of counting the number of matrix multiplies we're doing, counting the number of hidden layers that we have. I think it's, you can use either, I think maybe two-layer neural network is something that's a little more commonly used. And then also here,
for our three-layer neural network that we have, this can also be called a two-hidden-layer neural network. And so we saw that, you know, when we're doing this type of feed forward, right, forward pass through a neural network, each of these nodes in this network is basically doing the kind of operation of the neuron that I showed earlier, right? And so what's actually happening is is basically each hidden layer you can think of as a whole vector, right, a set of these neurons, and so by writing it out this way with these matrix multiplies to compute
our neuron values, it's a way that we can efficiently evaluate this entire layer of neurons, right? So with one matrix multiply we get output values of, you know, of a layer of let's say 10, or 50 or 100 of neurons. All right, and so looking at this again, writing this out, all out in matrix form, matrix-vector form, we have our, you know, non-linearity here. F that we're using, in this case a sigmoid function, right, and we can take our data x, some input vector or our values, and we can apply our first matrix multiply, W-one
on top of this, then our non-linearity, then a second matrix multiply to get a second hidden layer, h-two, and then we have our final output, right? And so, you know, this is basically all you need to be able to write a neural network, and as we saw earlier, the backward pass. You then just use backprop to compute all of those, and so that's basically all there is to kind of the main idea of what's a neural network. Okay, so just to summarize, we talked about how we could arrange neurons into these computations, right, of fully-connected
or linear layers. This abstraction of a layer has a nice property that we can use very efficient vectorized code to compute all of these. We also talked about how it's important to keep in mind that neural networks do have some, you know, analogy and loose inspiration from biology, but they're not really neural. I mean, this is a pretty loose analogy that we're making, and next time we'll talk about convolutional neural networks. Okay, thanks.