12a: Neural Nets

528.52k views6976 WordsCopy TextShare

MIT OpenCourseWare

*NOTE: These videos were recorded in Fall 2015 to update the Neural Nets portion of the class. MIT 6...

Video Transcript:

the following content is provided under a Creative Commons license your support will help MIT OpenCourseWare continue to offer high quality educational resources for free to make a donation or to view additional materials from hundreds of MIT courses visit MIT opencourseware at ocw.mit.edu you it was in 2010 yes that's right it was in 2010 we were having our annual discussion about what we would dump from 603 for in order to make room for some other stuff and we almost killed off neural nets that might seem strange because you know our heads are stuffed with neurons if

you open up your skull and pluck them all out you don't think anymore so it would seem that neural nets would be a fundamental and unassailable topic but many of us felt that the neural models of the day weren't much in a way of faithful models of what actually goes in inside our heads and besides that nobody had ever made a neural that was worth darn for doing anything so we almost killed it off but then we said well the everybody feel cheated if they take a course in artificial intelligence don't learn anything about neural

nets and then I go off and invented themselves and all waste all sorts of time so we kept the subject in then two years later geoff hinton from the university of toronto stunned the world with some real net work he had done on recognizing and classifying pictures and he published a paper from which I am now going to show you a couple of examples Geoff's a neural net by the way had 60 million parameters in it and it was his purpose was to determine which of a thousand categories best characterized a picture so there it

is there's a sample of things that the Toronto neural net was able to recognize or make mistakes on I'm going to blow that up a little bit I think I'm going to look particularly at the example there later labeled container ship so what you see here is that the program returned its best estimate of what it was ranked verse 5 according to the likelihood of probability or the certainty that it felt that a particular class was characteristic of the picture and so you can see this one is extremely confident that it's a container ship it

also was fairly moved by the idea that might be a lifeboat now I'm not sure about you but I don't think this looks much like the lifeboat but it does look like a container ship so if I look at only the best choice it looks pretty good here are the other things that it did pretty well on got the right answer is the first choice is this first choice so over on the left you see that it's decided that the picture is a picture of a might the might is not anywhere near the center of

the picture but somehow it managed to find it the container ship again there's a motor scooter a couple people sitting on it but it correctly characterized the picture as a motor scooter and then on the right a leopard and everything else is a cat of some sort so it seems to be doing pretty well in fact it does do pretty well but anyone who does this kind of work has an obligation to show you some of the stuff it doesn't work so well on or it doesn't get quite right and so these pictures also occurred

in Hinton's paper so the first one is characterized as a grill but the right answer was supposed to be convertible oh no yes yeah right answer was convertible the second case the characterization is of a mushroom and the alleged right answer is a garret is that pronounced right it turns out that's a kind of mushroom so no problem there in the next case it said it was a cherry but it was supposed to be a Dalmatian now I think a Dalmatian is a perfectly legitimate answer for that particular picture so hard to faulted for that

in the last case it didn't you know the the correct answer was not in any of the top five I'm not sure if you've ever seen a Madagascar cat but that's a picture of one and it's interesting to compare that with the first choice of the program squirrel monkey this is the two side-by-side so in a way it's not surprising that it thought that the madagascar cat was a picture of a squirrel monkey so pretty impressive it blew away the competition it did so much better the second place wasn't even close and for the first

time demonstrated that a neural net could actually do something and since that time in the three years since that time there's been an enormous amount of effort put into neural net technology which some say is the answer so we're going to do today and that tomorrow is have a look at this stuff and ask ourselves why it works when it might not work what needs to be done what has been done and all those kinds of questions will emerge so I guess the first thing to do is to think about what it is that we

are being inspired by we're being inspired by those things that are going that are inside our head all 10 to the 11th of them and so if we take one of those 10 to the 11th and look at it you know from 7 of something or rather approximately what a neuron looks like and by the way I'm going to teach you in this lecture how to answer questions about neurobiology with an 80% probability that you will give the same answer as a neurobiologist ok so let's go so here's a neuron it's got a cell body

in there is a nucleus and then out here is a long thingamajigger which divides maybe a little bit but not much and we call that the axon so then over here we've got this much more branching type of structure that looks maybe a little bit like so I know maybe like that and this stuff branch is a whole lot and that part is called a dendritic tree now there are a couple of things we can note about this is that these guys are connected of axon to dendrite so over here there will be a so-called

presynaptic thickening and over here will be some other neurons dendrite and likewise over here some other neurons axon is coming in here and hitting the dendrite of our set of the one occupies most of our picture so if there's enough stimulation from this side in the axonal tree or the dendritic tree then a spike will go down that axon it acts like a transmission line and then after that happens the after that the neuron will go quiet for a while is it's kind of recovering it's straight that's called a refractory period now if we look

at them that connection in a little more detail this little piece right here it sort of looks like this here's the axon coming in it's got a whole bunch of little vesicles in it and then there's a dendrite over here and with us and when the axon is stimulated it dumps all these vesicles into this inner synaptic space for a long time it wasn't knowing whether those things were actually separated I think it was Mamoni Hall who demonstrated that the that one neuron is actually not part of the next one there actually operated by these

synaptic by the synaptic gaps so oh there it is how can we model that sort of thing well here's what's usually done here's what is done in the neural net literature first of all we've got some kind of binary input because these things either fire or they don't fire so it's an all-or-none kind of situation so over here we have some kind of input value we'll call it x1 and it's either a 0 or a 1 so it comes in here and then it gets multiplied times some kind of weight we'll call it w1 so

this this part here is sort of modeling this synaptic connection it may be more or less strong and if it's more strong this weight goes up and if it's less strong this weight goes down so that's that's the that's the influence that reflects the influence of the synapse on whether or not the whole axon besides is stimulated and we got other inputs down here X a man also zero Y it's also multiplied by a weight we'll call that W Seban and now we have to somehow represent the way in which these these inputs are collected

together how that how they have collective force and we're going to model that very very simply just by saying ok go right on through a summer like so but then we have to decide if the collective influence of all those inputs is sufficient to make the neuron fire so we're going to do that by running this guy through a threshold box like so here is what the box looks like in terms of the relationship between input and the output and what you can see here is that nothing happens until the input exceeds some threshold T

if that happens then the output Z is a 1 otherwise it's a zero so binary and binary out we model the synaptic weights by these multipliers we model the accumulating accumulative effect of all that's input to the neuron by a summer we decide if it's going to be an all-or-none 1 by running it through this threshold box and seeing if the subway some products add up to more than threshold if so we get a 1 so what in the end are we in fact modeling well with this model we have number one all or not

number two cumulative influence and number three oh I suppose synaptic weight but that's not there might be to model in a real neuron we might want to deal with the refractory period not note in these biological models that we build neural nets out of we might want to model axonal bifurcation we do get some division in the axon of the neuron and it turns out that that pulse will either go down one branch or the other and which branch it goes down depends on electrical activity of the vicinity of the division so these things might

actually be fantastic coincidence detectors but we're not modeling we don't know how it works so excellent oh by FERC ation might be ma we might also have a look at time patterns see what we don't know is we don't know if the timing of the arrival of these pulses in the dendritic tree has anything to do with what that neuron is going to recognize all right so a lot of unknowns here and now I'm going to show you how to answer a question about neurobiology with 80% probability you'll get it right just say we don't

know and that will be with 80% probability what the neurobiologists would say so this is a model inspired by what goes on in our heads but it's far from a it's far from clear if what we're modeling is the essence of why those guys make possible what we can do nevertheless so we're going to start we're going to go so we've got this model of what it's what a neuron does so what about what is a collection of these neurons do well we can think of your skull as a big box full of neurons maybe

a better way to think of this is that your head is full of neurons and they in turn are full of weights and thresholds like so so into this box come a variety of inputs X 1 through X m and these find our way to the inside of this gaggle of neurons and out here become a bunch of outputs c1 to Z N and a whole bunch of these maybe like so there are a lot of inputs like so and somehow these inputs through the influence of the weights and the thresholds come out as a

set of outputs so we can we can write that down a little fancier by just saying that Z is a vector which is a function of well certainly the input vector but also the weight vector and a threshold vector so that's all in there on that is and when we train a neural net all we're going to be able to do is adjust those weights and thresholds so that what we get out is what we want so a neural net is a function approximator it's good to think about that it's a function approximator so maybe

we've got some sample data that gives us an output vector that's desired as a function as another function of the input forgetting about what the weights and the thresholds are that's what we want to get out and so how well we're doing can be figured out by comparing the desired value with the actual value so we might think then that we can get a handle on how well we're doing by constructing some performance function which is determined by the desired vector and the input vector sorry the desired vector and the actual output vector for some

particular input or for some set of inputs and the question is what what should that function be how should we measure performance our given that we have what we want out here and what we actually got out of here well one simple thing to do is just to measure the magnitude of the difference that makes sense but of course that would give us a performance function that is a function of the distance between those vectors would look like this but that turns out to be mathematically inconvenient in the end so how do you think we're

going to turn it up a little bit what's that well I don't know how about this we squared that way we're going to go from this little sharp point down there to something looks more like that so it's best when the difference is zero of course and it gets worse as you move away from zero but what we're trying to do here is we're trying to get to a minimum value and and I hope you'll forgive me I just don't like the direction we're going here because I like to think in terms of improvement is

going uphill instead of downhill so I'm going to dress this up just one more step put a minus sign out there and then our performance function looks like this it's always negative and it gets and the best value it could possibly be a zero so that's where we're going to use just because I am Who I am it doesn't matter right still you're trying to either minimize or maximize some performance function okay so what are we going to do I guess what we could do is we could treat this thing what we already know what

to do I'm not even sure we're ever devoting a lecture to this because it's clear that what we're trying to do is we're trying to take our weights and our thresholds and adjust them so as to maximize performance so we can make a little contour map here with a simple neural net with just two weights in it and maybe it looks like this contour map and at any given time we've got a particular W one in particular - W - and we're trying to find a better W 1 and W 2 so here we are

right now and there's the contour map and this is 603 4 so what do we do a simple simple matter of hill-climbing right so we'll take a step in every direction if we take a step in that direction not so hot that actually goes pretty bad these two are really ugly ah but that one that one takes up a hill a little bit so we're done except that I just mentioned it Hinton's neural net has 60 million parameters in it so we're not going to hill climb with 60 million parameters because that's high it explodes

exponentially in the number of weights you've got to deal with the number of steps you can take so so this perch is computationally intractable fortunately you've all taken 801 1801 or the equivalent thereof so you have a better idea instead of just taking a step in every direction what we're going to do is we're going to take some partial derivatives and we're going to see what does they suggest to us in terms of how we're going to get get around in his space so we might have the parcel of that performance function up there with

respect to w1 and we might also take the partial derivative of that guy with respect to w2 and these will tell us how much improvement we're getting by making a little movement in those directions right there how much how much a changes give them that we're just going right along the axis so maybe what we ought to do is if this guy is much bigger than this guy it would suggest that we mostly want to move in this direction or to put in 1801 terms what we're going to do is we're going to follow the

gradient and so the change and the W vector is going to equal to this partial derivative times I plus this partial derivative times J so what we're going to end up doing in this particular case by following that formula is moving off in that direction right up right up the steepest part of the hill and you know how much we move is a question so let's just have a rate constant R that decides how big our step is going to be and now you think we were done well too bad for our side we're not

done there's a reason why we can't use gradient ascent or in the case that I've drawn our gradient descent if we take the performance function the other way why can't we use it the remark is local maxima and that is certainly true but it's not our first obstacle when does gradient descent work ah there's something wrong with our function that's right it's nonlinear or it's rather it's discontinuous so gradient centric wires a continuous space continues surface so too bad for our side it isn't so what to do well nobody knew what to do for 25

years people were screwing around with training neural nets for 25 years before Paul where Bo's sadly at Harvard in 1974 gave us the answer now I want to tell you what the answer is first part of the answer is those thresholds are annoying it just suggests extra baggage to deal with what we really like instead of Z being a function of X W and T was a like C prime to be a function f prime of X and the weights but we got to account for the threshold somehow so here's how you do that what

you do is you say let us add another input to this neuron and it's going to have a weight W 0 all right and it's going to be connected to an input that's always minus 1 be with me so far we're just going to add enough now what we're going to do is we're going to say let W 0 equal T what's that do the two movement of the threshold what it does is it takes that threshold moves it back to zero so this is a little trick here takes this pink threshold and redoes it

so that the app that the new threshold box looks like this all right think about it if if this is T and this is minus 1 this is minus T and so this thing ought to fire if everything is over if the subs over zero so it makes sense it gets greater the it gets rid of the threshold thing for so now we can just think about weights but still we've got that we've got that we've got that step function there and that's not good so what we're going to do is we're going to smooth

that guy out so this is trick number two instead of a step function we're going to have this thing we lovingly call a sigmoid function because it's kind of going to S type shape and the function we're going to use is just one one what better make it a little bit different 1 over 1 plus e to the minus whatever the input is let's call the input alpha that makes sense to see if alpha is 0 that's 1 over 1 plus 1 so it's 1/2 if alpha is extremely big then e to the minus alpha

is extremely small it becomes 1 goes up to an asymptotic value 1 here on the other hand if alpha is extremely negative then e to the minus alpha is extremely positive and it goes to 0 asymptotically so we got the right look to that function it's a very convenient function did God say that neurons ought to be that threshold lot of work like that no God didn't say so who said so the math says so it has the right shape and look and the math and it turns out to have the right map well as

we'll see in a moment ok so let's see where are we we decided that what we'd like to do is take these partial derivatives we noted there was awkward to have those thresholds so we got rid of them and we noted it was impossible to have the step function so we got rid of it now we're in a situation where we can actually take those partial derivatives and see if it gives us a way of training the neural net so as to bring the actual output into alignment what we desire alright so the deal with

that we're going to have to work with the world's simplest neural net now if we've got one neuron it's not a net but if we've got two word neurons we've got a net it turns out that's the world's simplest neuron so we're going to look at it not 60 million parameters but just a few actually just two parameters so let's draw it out we've got an input X that goes into a multiplier and it gets multiplied times w1 and that goes into a sigmoid box like so we'll call this p1 by the way product number

one out here comes why why gets multiplied times another weight we'll call that w-2 then that produces another product which we'll call p2 and that goes into a sigmoid box and then that comes out as Z and Z is the number that we use to determine how well we're doing and our performance is actually going to be our performance function P is going to be one half minus one half because I like things are going in that direction times the difference between the desired output and the actual output squared okay so now let's decide what

those partial derivatives are going to be oh let me do it over here so what are we trying to compute partial of the performance function P with respect to w2 okay well let's see you know partial we're trying to figure out how much this wiggles when we wiggle that alright but you know it goes through this variable p2 and so maybe what we can do is figure out how much this wiggles when we go how much z wiggles when we wiggle p2 and then how much p2 wiggles only wiggle w2 and just multiply those together

I forget what's that call an eighty know something or other this chain rule so we're going to do is we're going to rewrite that partial derivative using chain rule and all it's doing is saying that there's an intermediate variable and we can compute how much that end wiggles with respect to how much dead end wiggles by multiplying the how much of the other guy's wiggle let me write it down at baek's where it sets in mathematics so that's going to be equal to the partial of P with respect to Z times the partial of P

oh sorry partial of Z with respect to p2 Wow keep me on track here partial of Z with respect to w2 now I'm going to do something for which I will hate myself I'm going to erase something on the board I don't like to do that but but you know what I'm going to do done I'm going to say this this this is true by a chain rule but look I could take this guy here and screw around with it with a chain rule too and in fact what I'm going to do is I'm going

to replace that with partial of Z with respect to p2 and partial of p2 with respect to w2 right so I didn't erase it after all but you can see what I'm going to do next I'm going to do the same thing with the other partial derivative but this time instead of writing down and writing over I'm just going to expand it all out one go I think so partial of P respect to w1 is equal to the partial of P with respect to Z the partial of Z with respect to P to the partial

of p2 with respect to what Y partial of Y with respect to p1 partial of P 1 with respect to w1 so that's just kind of going like a zipper down that's that string of variables expanding each bar using chain rule until we got to the end so there are some expressions that provide those partial derivatives but now if you don't if you'll forgive me it was convenient to write them out that way except that match the intuition in my head but I'm just going to turn them around about just some it's just a product

I'm just going to turn them around so partial P to partial W 2 times partial of Z partial P 2 times the partial of P with respect to Z same thing and now this one keep me on track because if there's a mutation here will be fatal partial of P 1 partial W 1 partial of Y partial P 1 partial P 2 partial of Y parcel of Z times the partial of p2 partial of performance function respect to Z okay now all I have to do is figure it out figure out what those parcels are

and we have solved this little simple neural net so that's going to be easy oh where are where is my board space let's say parcel of p2 we expect it to what partial that's that's the product parcel of Z partial the performance function with respect to Z oh now I can see why I wrote it down this way let's see it's going to just be D minus Z we can do that one in our hand what about the partial of p2 is expect to w2 well p2 is equal to Y times w2 so that's easy

that's just Y now all we have to do is figure out the partial of Z with respect to p2 crapple is just it's going through this this threshold box so I don't know exactly what that partial derivative is so we'll have to figure that out right because the function relating them is this is this guy here and so we have to figure out the partial of that with respect to alpha all right so we got to do it there's no foot there's no way around it so we have to destroy something ok we're going to

destroy our neuron so the function we're dealing with is we'll call it beta is equal to 1 over 1 plus e to minus alpha and we want what we want is the derivative with respect to alpha beta and that's equal to D by D alpha of you know I can never remember those quotient formulas so I'm going to rewrite it a little different way I want to write it as 1 minus e to the minus alpha to the minus 1 so I because I can't remember the formula for differentiated quotient okay so let's differentiate it

so that's equal to 1 minus alpha 1 minus e to the minus alpha to the minus 2 and we got that minus comes out of that part of it then we get differentiate D it's got differentiate the inside of that expression and when we differentiate it the inside of that expression we get e to the minus alpha yeah oh yeah sorry thank you that was that was one of those fatal mistakes you just prevented that's one plus is this one plus here too okay so so we've differentiated that we've turned that into a minus 2

we brought the minus sign outside then we're differentiating the inside the derivative of an exponential is an exponential then we go differentiate that guy and that just helps us get rid of the minus sign reduced so that's the derivative I'm not sure how much that helps except that I'm going to perform a parlor trick here and rewrite that expression Leslie I'm going to say that's going to be e to the minus alpha over 1 plus e to the minus alpha times 1 over 1 plus e to the minus alpha naught okay I've got a lot

of nodding heads here so I think I'm saying unsafe ground but now when you perform another polar trick I'm going to add 1 which means I also have to subtract 1 all right that's legitimate isn't it so now I can rewrite this as 1 plus e to the minus alpha over 1 plus e to the minus alpha minus 1 over 1 plus e to the minus alpha times 1 over 1 plus e to the minus alpha hey high school kid could do that I think I'm on safe ground oh wait this is beta this is

beta oh sorry wrong wrong side better make this beta and this one any high school kid could do it ok so what we've got then is that this is equal to 1 minus beta times beta that's the derivative and that's weird because the derivative of the output with respect to the input is given exclusively in terms of the output strange doesn't really matter but it's a curiosity and what we get out of this is that that partial derivative there that's equal to well the output is p2 no the output is Z so it's C times

1 minus C so whenever we see the output the derivative one of these sigmoids with respect to its input we can just write the output type 1 minus output we've got it so that's why it's mathematically convenient it's mathematically convenient because when we do this differentiation we get a very simple expression in terms of the output we get a very simple expression that's all we really need so would you like to see a demonstration this is demonstration of world's smallest neural net in action where as no that's here we go so there's our neural net

what we're going to do is we're going to Train it to do absolutely nothing we're going to do is train it to help make the output the same as the input now what I call a fantastic leap of intelligence but let's see what happens well nothing's happening well it finally got to this point where the maximum error not the performance but the maximum error went below a threshold that I had previously determined so if you look at the input here and compare that with the desired output on the far right you see it produces an

output which compared with the desired output it's pretty close so we can test the other way like so and we can see that the desired output is pretty close to the actual output in that case - and it took 694 iterations to get that done let's try it again Oh to 823 of course this is all consequences just starting off with random weights by the way if you started off with the weights all the weights being the same what would happen nothing because they'd always stay the same so you've got to put some randomization into

the beginning so took a long time maybe the problem is our rate constants too small so let's try it let's crank up the rate constant or more old it and see what happens nah that was pretty fast let's see if it was a consequence of random chance run now it's pretty fast they are 57 times 57 iterations third try 67 so it looks like my initial rate constant was too small so if point 5 was not as good as 5.0 why don't we try and get up to 50 and see what happens Oh in this

case 124 let's try it again ah in this case 117 so it's actually gotten worse and not only has it gotten worse you'll see that there's a little bit of a little bit of instability showing up as of course is along its way toward a solution so what it looks like is that if you've got to wait rate constants to small it takes forever you got a rate constants too big it can sort of jump too far as in my diagram which is somewhere underneath the board you can you can go all the way across

the hill and get to the other side so you have to be careful about the rate constant so what you really want to do is you want your rate constant to vary with what's happening in the as you progress toward an optimal part an optimal performance so if you if you if your performance is going down when you make the jump you know you've got a rate constants too big if your performance is going up when you make a jump maybe you want to increase it bump it up a little bit until it until it's

a little it doesn't look so good okay so is that all there is to it well not quite because this is the world's simplest neural net and maybe we have to look at the world's second simplest neural net now let's call this well let's call this X what we're going to do is going to have a second input and I'm just I don't know maybe this is screwy I'm just going to use color coding here to differentiate between the two inputs and and a and the stuff they go through maybe I'll call this Z 2

and this C 1 and this X 1 and X 2 now if I do that if I've got two inputs and two outputs then then my performance function is going to have two numbers in it the two desired values and the two actual values and I'm going to have two inputs but you know it's the same stuff I just repeat that I just repeat what I did in white only I make it Orange oh but what happens if what happens in Phi I do this say pretty little cross connections in there so these these two

streams are going to interact and then there might be some you know this this Y can go into another multiplier here and go into a summer here and likewise this Y can go up here and into a multiplier like so and there are weights all over the place like so this guy goes up into here and now what happens now we've got a disaster on our hands because there are all kinds of paths through this network and you can imagine that if this was not just two neurons deep but three neurons deep what I would

find is expressions that look like that but you know you could go this way and then down through here and out here or you could go this way and then back up through here so it looks like there is an exponentially growing number of paths through that Network and so we're back to an exponential blow-up and it won't work yeah won't work except that we need to let the mass sing to us a little bit and we need to look at the picture and the reason I turn this guy around was actually because from a

point of view of letting the math sing to us this piece here is the same as this piece here so part of what we needed to do to calculate the partial derivative with respect to w1 has already been done when we calculated the partial derivative with respect to w2 and not only that if we calculated the partial with respect to these breen w's at both levels but would discover is that that sort of repetition occurs over and over again and now I'm going to try to give you an intuition 'el idea of what's going on

here rather than just right down to math and salute it and and here's here's a way to think about it for an intuitional point of view whatever happens to this performance function that's back of this row back back back of these Peas here the stuff over there can influence P only by going through influenced performance only going through this column of piece and there's a fixed number of those so it depends on the width not the depth of the network so the influence of that stuff back there on P is going to end up going

through these guys and it's going to end up being so that we're going to discover that a lot of what we need to compute in one column has already been computed in a column on the right so it isn't gonna explode exponentially because the influence let me say it one more time the influence of changes the influences of changes in P on the performance there's all we care about when we come back to this part of the network because this stuff cannot influence the performance except by going through this column of peace so it's not

going to blow up exponentially we're going to be able to reuse a lot of the computation so it's the reuse principle have you ever seen a reuse principle at work before not exactly but you remember that little business about the extended list we know that we've seen help we know we've seen something before so we didn't stop computing it's like that we're going to be able to reuse the computation that so we've already done so to prevent an exponential blow-up by the way for those of you know about fast Fourier transform same kind of idea

reuse of partial results so in the end what can we say about this stuff in the end what we can say is that it's linear and depth now does it say if we increase the number of layers the so called depth then we're going to increase the amount of computation necessary in a linear way because the computation we need in any column is going to be fixed what about how it goes with respect to with respect to the width well we respect to the width anything on here can be connected any year on the next

row so the amount of work we're going to have to do will be proportional to the number of connections so respect to width it's going to be W squared but the fact is that in the end this stuff is readily computed and this phenomenally enough was overlooked for 25 years so what did it take what what is it in the end in the end it's extremely simple idea all great ideas are simple how come there are more of them well because frequently that simplicity involves finding a couple tricks and making a couple of observations so

usually we we humans hardly ever go beyond one Trick or one observation but if you cascade a few together sometimes something miraculous falls out that looks in retrospect extremely simple so that's why we got the reuse principle at work trying to reuse computation in this case the miracle was consequence of two tricks plus an observation and the overall idea is all great ideas are simple and easy to overlook for a quarter century