12b: Deep Neural Nets

182.89k views6661 WordsCopy TextShare

MIT OpenCourseWare

*NOTE: These videos were recorded in Fall 2015 to update the Neural Nets portion of the class. MIT 6...

Video Transcript:

the following content is provided under a Creative Commons license your support will help MIT OpenCourseWare continue to offer high quality educational resources for free to make a donation or to view additional materials from hundreds of MIT courses visit MIT opencourseware at ocw.mit.edu you well what we're going to do today is climb a pretty big mountain because we're going to go from a neural net with two parameters to discussing the kind of neural nets in which people end up dealing with 60 60 million parameters so it's going to be a pretty big jump but along the

way are a couple things I wanted to underscore from our previous discussion last time I try to develop some intuition for the kinds of formulas that you use to actually do the calculations in a small neural net about how the weights are going to change and the main thing I tried to emphasize is that when you have a neural net like this one everything is sort of divided in each column you can't have the performance based on this output effect some weight change back here without going through this finite number of output variables the Y

ones but by the way there's no y 2 and y 4 this is there's no y2 and y3 that dealing with this is a really a notational nightmare and I spent a little time a lot of time yes straight trying to clean it up a little bit but basically what I'm trying to say has nothing to do with the notation I've used but rather with the fact that there's a limited number of ways in which that can influence this even though the number of paths through this network can be growing exponential so those equations underneath

our equations that are derived from trying to figure out how the output performance depends on some of these weights back here and what I have calculated as I've calculated the dependence of the performance on w1 going that way and account and have also calculated the dependence on performance performance on w1 going w1 going that way so that's one of the equations I've got down there and another one deals with w3 and it involves going both this way and this way and all I've done in both cases in all four cases is just take the partial

derivative performance with respect to those weights and use the chain rule to expand it and when I do that this is the stuff I get and that's just a whole bunch of partial derivatives but if you look at it and then it's sing a little bit to you what you see is that there's a lot of redundancy in the computation so for example this guy here partial performance with respect w1 depends on both paths of course but look at the first elements here these guys right here and look at the first elements in the expression

for calculating the partial derivative performance respect to w3 these guys are the same and not only that if you look inside these expressions and look at this particular piece here you see that that is an expression that was needed in order to calculate one of the downstream weights changes one of the downstream weights what it happens to be the same thing as you see over here and likewise this piece is the same is the same thing you see over here so each time you move further and further back from the output store the inputs you're

reusing a lot of computation that you've already done so I trying to find a way to slogan eyes this and what I've come up with is what's done is done and cannot be no no let's count them not quite right is it that's um what's computed is computed and need not be recomputed okay so that's what's going on here and that's why this is a calculation that's linear in the depth the depth of the neural net not exponential there's another thing I wanted to point out in connection with these neural nets and that has to

do with what happens when we look at a single neuron and note that what we've got because we've got a bunch of weights but you're multiplying times a bunch of inputs like so and then those are all summed up in a summing box before they enter some kind of non-linearity in our case a sigmoid function but if I ask you to write down the expression for the value we've got there what is it well it's just the sum of the W's times the X's what's that that's the dot product remember a few lectures ago I

said that some of us believe that the dot product is a fundamental calculation that takes place in our heads so this is why we think so if neural nets are doing anything like this then there's a dot product between some weights and some input values now it's a funny kind of dot product because in a models we've been using these input variables are all or none or 0 or 1 but that's ok I have a good authority that there are neurons in our head for which the values that are produced are not exactly all or

none but rather have a kind of proportionality to them so you get a real dot product type of operation out of that so that's by way of a couple of the sides that I wanted to underscore before we get into the center of today's discussion which will be to talk about the so-called deep nets now let's see what's a deep net do well from last time you know that a deep net does that sort of thing and it's interesting to look at some of the some of the offerings here by the way how good was

this performance in 2012 well it turned out that the fraction of the time that the system had the right answer in its top five choices was about 15% and the fraction of the time that it got exactly the right answer as its top pick was about 37 percent error 15 percent error if you count it as an error if it's what am I saying you got it right if you got it in the top 5 an error rate on that calculation about 15 percent if you say you only get it right if you got it

if it was your top choice then the error rate was about 37 percent so pretty good especially since some of these things are highly ambiguous even to us and what kind of a system did that well it wasn't it wasn't a one that looked exactly like that although that is the essence of it the system actually looked like that there's quite a lot of stuff in there and what I'm going to talk about is not exactly this system but I'm going to talk about the stuff of which such such systems are made because there's nothing

particularly special about this it just happens to be a particular assembly of components that tend to reappear when everyone when anyone does is sort of neural net stuff so let me explain that this way first thing I need to talk about is the concept of well I don't like the term it's called convolution I don't like the term because you know and in the second-best course if the Institut signals and systems you learn about impulse responses and convolution integrals and stuff like that and this hints at that but it's not the same thing because there's

no there's no memory involved in what's going on as these signals are processed but they call it convolutional neural Nets anyway so here you are you've got some kind of image and even with lots of computing power and GPUs and all that sort of stuff we're not talking about images with the you know four thousand four million pixels we're talking about images that might be 256 on the side let's say we're not talking about images that are a thousand by a thousand or four thousand by four thousand anything like that they tend to be a

kind of compressed into 256 by 256 image and now what we do is we run over this with a neuron that is looking only at a 10 by 10 square like so and that produces an output and next we run over that again having shifted this neuron a little bit like so and then the next thing we do is we shift it again so I get that output right there so each of those deployments of a neuron produces an output and that output is associated with a particular place in the image this is a process

that is called the convolution as a term of art this guy this convolution operation results in a bunch of points over here and the next thing that we do with those points is we look in local neighborhoods and see what the maximum value is and then we take that maximum value and construction yet another mapping of the image over here using that maximum value then we slide that all over like so and we produce another value then we slide that value then we slide that over one more time the different color and now we've got

yet another value so this process is called Puli and because we're taking the maximum this particular kind of pulling is called max point so now let's see what's next this is taking a particular neuron and running across the image but maybe there are lots of we call that a kernel again sucking some terminology out of signals and systems but now what we're going to do is we're going to say we're going to we could we can use a whole bunch of kernels so the thing that I produce with one kernel can now be repeated many

times like so in fact a typical number is a hundred times so now what we've got is we've got a 256 by 256 image we've gone over it with a 10 by 10 Colonel we've taken the maximum value of values that are in the vicinity of each other and then we repeated that 9600 times so now we can take that and we can feed all those results into some kind of neural net and then we can do perhaps a fully connected job on the final layers of this and then that in the ultimate output we

get some sort of indication of how likely it is that the thing that's being seen is say a mite so that's roughly how these things work so what do we talked about so far we've talked about pooling and we've talked about convolution and now we can talk about some of the good stuff but before I get into that this is what we can do now and and and you can compare this with what was done in the old days it was done in the old days before massive amounts of computing became available is a kind

of real net activity that's a little easier to see you might in the old days only have enough computing power to deal with a small grid of picture elements or so-called pixels and then each of these might be a value that is fed as an input into some kind of neuron and so you might have a column of neurons that are looking at these pixels in your in your in your image and then there might be a small number of columns that follow from that and finally something that says this neuron is looking for things

clicks are a number one that is to say something that looks like another one in the image so this stuff up here is is what you can do when you have a massive amount of computation relatives the kind of thing you used to see in the old days so what's different well what's different is instead of a few hundred parameters we've got a lot more instead of ten digits we have a thousand classes instead of a few hundred samples we have maybe a thousand examples of each class so that makes a million samples and we

got sixty million parameters to play with and the surprising thing is that the net result is we've got a function approximator that astonishes everybody and no one quite knows why it works except that when you throw an immense amount of computation into this kind of arrangement it's possible to get performance that no one expected would be possible so that's sort of the bottom line but now there are a couple of ideas beyond that that I think are especially interesting and I want to talk about though first idea that's especially interesting is the idea of Auto

coding and here's how the idea of Auto coding works I'm going to run out of board space so I think I'll do it right here you have some input values they go into a layer of neurons the input layer then there is a so-called hidden layer that's much smaller so maybe there are in the example they'll be ten neurons here and just a couple here and then these expand to an output layer like so now we can take the output layer C 1 through Zn and compare it with a is the desired values d1 through

TN you follow me so far now the trick is to say well what are the desired values let's let the desired values be the input values so we're going to do is we're going to Train this net up so the outputs the same as the input what's good of that but we're going to force it down through this neck down piece of network so if this network is going to succeed in taking all the possibilities here and cramming them into this smaller inner layer this so-called hidden layer such that they it can reproduce the input

at the output it must be doing some kind of generalization of the kinds of things that sees on its input and that's a very clever idea and it's seen in various forms in a large fraction of the papers that appear on deep neural nets but now I want to talk about an example so I can show you a demonstration okay so we don't have GPUs and we don't have three days to do this so I'm going to make it make up a very simple example that's reminiscent of what goes on here but but involves hardly

any copy what I'm going to imagine is we're trying to recognize animals from how tall they are from the shadows that they cast so we're going to recognize three animals a cheetah a zebra and a giraffe and they will each cast a shadow on a blackboard like me no no vampire involved here and what we're going to do is we're going to use the shadow as a input to a neural net all right so let's see how that would work so there is our network and if I just click into one of these test samples

that's the height of the shadow that ochita casts on a wall and there are 10 input neurons corresponding to each level of the shadow their rim through 3 inner layer neurons and from that it spreads out and becomes the outer layer values and we're going to compare those outer layer values to the desired values but the desired values are the same as the input values so this column is a column of input values on a far right we have our column of desired values and we haven't trained as an owner yet all we've got is

random values in there so if we run the test samples through we get that and that yeah cheetahs are short zebras are medium height and giraffes are tall but our output is just pretty much 0.5 for all of them for all of those shadow Heights all right no training so far so let's run this thing we're just using simple backdrop just like just like on our world simplest neural net and it's interesting to see that what happens is you see all those values changing now I need to mention that when you when you see a

green connection that means it's a positive weight and the green density of the green indicates how positive it is and the red ones are negative weights and the intensity of the red indicates how red it is so here you can see that we still have from from our random inputs a variety of red and green values we haven't really done much training so everything correctly looks pretty much random so let's run this thing and after only a thousand iterations going through these examples and trying to make the output the same as the input we reached

a point where the error rate has dropped in fact it's dropped so much it's interesting to relook at the test cases so here's a test case where we have a cheetah and now the output value is in fact very close to the desired value and all and all the output in all the output neurons so if we look at another one once again there's a correspondence and write two columns and if we look at the final one yeah there's a correspondence in the right two columns knowing you back up from this and say well what's

going on here it turns out that you're not training this thing to classify animals you're training it to understand the nature of the things that it sees in the environment because all it sees is the height of a shadow doesn't know anything about the classifications you're going to try to get out of that all it sees is that there are there's a kind of consistency and the kind of data that it sees on the input values all right now you might say okay oh that's cool because what must be happening is that that hidden layer

because everything is forced through that narrow pipe must be doing some kind of generalization so it ought to be the case that if we click on each of those neurons we ought to see it specialized to a particular height because that's what sorts of stuff that's the sort of stuff that's presented on the input well let's go see what is what in fact is the maximum stimulation to be seen on the neurons in that hidden layer so if I when I click on these guys what we're going to see is the input values that maximally

stimulate that neuron and by the way I have no idea how this is going to turn out because it's all you know the initialization is all random well that's good that one looks like it's generalized the notion of short that doesn't look like medium in fact the maximum stimulation doesn't involve any stimulation from that lower neuron here look at this one that doesn't look like tall so we got one that looks like short and to look just look really random so in fact maybe we'd better back off the idea that what's going on in that

hidden layer is generalization and say that what's going on in there is maybe the encoding of a generalization it doesn't look like a encoding we can see but there is an encoding there is a generalization that's where you start that over we don't see the generalization in the stimulating values what we have instead is we have some kind of encoded generalization and because we got this stuff in CODIS what makes these neural nets so extraordinarily difficult to understand we don't just don't understand what they're doing we don't understand why they can recognize a cheetah we

don't understand why it can recognize a school bus in some cases but not in others because we don't really understand what these neurons are responding to well that's not quite true there's been a lot of work recently on try not to sort that out but it's still a lot of mess mystery in this world in any event I see that's the auto coding idea it comes in various guises sometimes people talk about Boltzmann machines and things of that sort but it's basically all the same sort of idea and so you can do this layer by

layer once you've trained the input layer then you can use that layer to train the next layer and then that can train the next layer after that and it's only at the very very end that you say to yourself well now I've accumulated a lot of knowledge about that environment and what can be seen in the environment maybe it's time to get around to using some some samples of particular classes and train on training on classes so that's the story on auto coding now the next thing to talk about is that final layer so let's

see what the final layer might look like let's see it's a it might look like this there's a summer there's a minus one up here let's see there's a minus one up here back there's a minus one up there there's a multiplier here and there's a threshold value there likewise there's some other input values here let me call this one X and it gets multiplied by some weight and then that goes into the summer as well and that in turn goes into a summing into a sigmoid that looks like so and finally you get an

output which we'll call Z so it's clear that if you just write out the value of Z as it depends on those inputs now using the formula that we work with last time and what you see is that Z is equal to 1 over 1 plus e to the minus W times X minus t plus T I guess all right so that's a sigmoid function that depends on the value of that weight and on the value of that threshold so let's look at how that those values might change things so here we have an ordinary

sigmoid and what happens if we shift it with the threshold value if we change that threshold value then it's going to shift the place where that sigmoid comes down so the change in T could cause this thing to shift over that way and if we change the value of W that could change how steep this guy is so we might think that the performance since it depends on W and T should be adjusted in such a way as to make the classification to the right thing but what's the right thing well that depends on the

samples that we've seen suppose for example that this is our sigmoid function and we see some examples of a class some positive examples of a class that have values that lie at that point and that point at that point and we have some values that correspond to situations where the class is not one of the things that are associated with this neuron and in that case what we see is examples that are over on this machine abhi here so the probability that we would see this particular guy in this world is associated with the value

on the sigmoid curve so you can think of this is the probability of that positive example and this is the probability of that positive example and this is the probability of that positive example what's the probability of this negative example well it's 1 minus the value on that curve and this one's 1 minus the value on that curve so we could go through the calculations and what we would determine is that to maximize the probability of seeing this data is this particular stuff in a set of experiments to maximize that probability we would have to

adjust T and W so as to get this curve doing the optimal thing and it's nothing mysterious about it it's just more partial derivatives in that sort of thing but the bottom line is that the probability of seeing this data is dependent on the shape of this curve and shape of this curve is dependent on those parameters and if we want to maximize the probability that we've seen this data then we have to adjust those parameters accordingly let's have let's have a look at a demonstration okay so there's an ordinary sigmoid curve here a couple

positive examples here's a negative example let's put in some more positive examples over here and now let's run a good old gradient descent algorithm on that and this is what happens you see how the probability as we adjust the shape of the curve the probability of seeing those examples of the class goes up and the probability of seeing the non-example goes down so what if we put some more examples in if we put a negative example there not much is going to happen but what happened if we put a positive example right there then we're

going to start seeing some dramatic shifts in the shape of the curve so that's probably a noise point but we can put some more negative examples in there and see how that adjusts the curve all right so that's what we're doing we're viewing this output value as something that's related to the probability of seeing a class and we're adjusting the parameters on that output layer so as to maximize the probability of the sample data that we've got in hand all right now there's one more thing we can see what we've got here is we've got

the basic idea of back propagation which has layers and layers of additional let me be flattering and call them ideas layered on top so here's the next here's the next idea that's layered on top so we've got an output value here and it's a function after all and it's a it's got a value and we're going to have a if we have a thousand classes we're going to have a thousand output neurons and each is going to be producing some kind of value and we can think of that value as a probability but what it

really but I don't want to write a probability I just want to say that but we've got for this output neuron is a function of class 1 and then there will not be another output neuron which is a function of class 2 and these values will be presumably higher this will be higher if we are in fact looking at class 1 and this one down here will be in fact higher if we're looking at class n so what we'd like to do is we'd like to not just pick one of these outputs and say well

you're the your that you've got the highest value so you win what we want to do instead is we want to give some sort we want to associate some kind of probability with each of the classes because after all we want to do things like find the most probable 5 so what we do is we say all right so the actual probability of class 1 is equal to the output of that sigmoid function divided by the sum over all functions so that takes them all that entire output vector and converts each output value into probability

so when we use that sigmoid function we did it with a view toward thinking about that as a probability in fact we assumed it was a probability when we made this argument but in the end there's there's such now there's now put for each of those classes and so what we get is in the end not exactly a probability until we divide by a normalizing factor so this by the way is called this by the way is called not on my list of things but it soon will be since we're not talking about taking the

maximum and identifying the and using that to classify the picture what we're going to do is we're going to say we're going to use what's called soft max so we're going to give a range of classifications and we're going to associate a probability with each and that's what that's what you saw in all of those samples you saw yes this is a container ship but maybe it's also this that or a third or fourth or fifth thing so that is a pretty good summary of the kinds of things that are involved but now we've got

one more step because what we can do now is we can take this output layer idea the softmax idea and we can put it together with the autocoding idea so we've trained this the middle layer up and now we're going to detach it from the output layer but retain those weights that connect the input to the hidden layer and when we do that what we're going to see is something that looks like this and now we've got a trained first layer but an untrained output layer we're going to freeze the input layer and train the

output layer using that that that so-called that sigmoid curve let's see what happens when we do that oh by the way that's what our test samples through you can see it's not doing anything the output is half for each of the categories even though we've got to train the middle layer so we have to train the outer layer let's see how long it takes well that was pretty fast now there's a extraordinarily good match between the outputs and the desired outputs so that's a combination of the auto coding idea and the softmax idea there's get

one more idea that's where they mentioned and that's the idea of dropout the plague of any neural net is that it gets stuck in some kind of local maximum so it was discovered that these things train better if on every iteration you you flip a coin for each neuron and if the coin ends up tails you assume it's just died and has no influence on the output it's called dropping out those neurons and in the next iteration you drop out a different set so what this seems to do is it seems to get this thing

from prevent this thing from going into a frozen local maximum state so that's deepness they should be called by the way wide nets because they tend to be enormous lis wide but rarely more than ten layers 10 layers 10 columns deep now let's see where to go from here maybe what we should do is talk about the awesome curiosity in the current state of the art and that is that all of this sophistication with output layers that are probabilities and training using Auto coding or Boltzmann machines it doesn't seem to help much relative to plain

old back propagation so back propagation with a convolutional net seems to do just about as good as anything and we're all while we're on the subject of ordinary deep net i'd like to examine a situation here where we have a deep net well it's not real is over it's a classroom deep net and we'll put five layers in there and this job is still to do the same thing it's to classify an animal as a cheetah zebra or a giraffe based on the height of the shadow it casts and as before if it's green that

means positive if it's red that means negative and right at the moment we have no training so if we run our test samples through the output is always a half no matter what the animal is all right so we're going to do is just going to use ordinary back prop on this same thing is in that sample that's underneath the black board only now we've got a lot more parameters we've got five columns and each one of them has nine or ten neurons in it so let's let this one run now look at that stuff

on the right it's all turned red at first I thought this was a bug in my program but you know it makes absolute sense if you don't know what the actual animal is going to be and a whole bunch of possibilities you better just say no for everybody just like when the biologist says we don't know it's the most probable answer about but eventually after about 160,000 iterations it seems to have got it let's run the test samples through that's doing great let's do it again just to see if this is a flu all right

on the right side finally as you start seeing some changes go in the final layers there and if you look at the error rate down at the bottom you see that it kind of kind of falls off a cliff so nothing happens for a real long time and it falls off a cliff now what would happen if this neural network not quite so wide good question but before we get to that question what I'm going to do is I'm going to do it a funny kind of variation on the theme of dropout what I'm going

to do is I'm going to kill off one neuron in each column and then see what if I can retrain the network to do the right thing so I'm going to reassign those to some other purpose so now there's one fewer neuron in the network we rerun that we see that it trains itself up very fast so we seem to be still close enough to the solution we can do without one you know one of the neurons in each column let's do it again yeah it goes up a little bit but it quickly falls down

to a solution try again quickly falls down to a solution oh my god how much is this am I going to do each time I knock something out and retrain it finds a solution very fast well I got all the way down to two neurons in each column and it still has a solution it's interesting I think but let's repeat the experiment but this time we're going to do it a little differently I take our five layers and we're going to before we do any training I'm going to knock out all but two neurons in

each column now I know that would turn neurons in each column I've got it a solution I just showed it I just showed one let's run it with this way it looks like increasingly bad news what's happened is that this suckers got itself into a local maximum so now you can see why there's been a breakthrough in this neural net learning stuff and it's because when you widen the net you turn local Maxima into saddle points so now it's got a way of crawling its way through this vast space without getting stuck on a local

maximum as suggested by this alright so those are some think interesting things to look at by way of these demonstrations but now I'd like to go back to my slide set and show you some examples that will address the question of whether these things are seeing like we see so you can try these examples online there are a variety of websites that allow you to put in your own picture and there is a cottage industry of producing papers in journals that fool neural nets so in this case a very small number of pixels have been

changed you don't see the difference but it's enough to take it take this particular neural net from a high confidence that it's looking at a school bus to to thinking that it's not a not a school bus those are some things that thinks are a school bus so it appears to be the case that what is triggering this school bus result is that it's seeing enough local evidence that this is not one of the other 999 classes and enough positive evidence from these local looks to conclude that it's a school bus so do you see

any of those things I don't hear you can say okay look at that baseball one yeah that looks like it's got a little bit of baseball texture in it so maybe what it's doing is looking at texture these are some examples from a recent and very famous paper by Google which use it essentially the same ideas to put captions on pictures so the thing that this by the way is what has stimulated all this enormous concern about artificial intelligence because a naive viewer looks at that picture and says oh my god this thing knows what

it's like to play or be young or move or what a frisbee is and of course it knows none of that it just knows how to label this picture and to the credit of people wrote this paper they show examples that don't do so well so yeah it's a cat but it's not lying I was a little girl but she's not blowing bubbles what about this one so we've been doing our own work in my laboratory on on some of this and the way the following set of pictures was introduced was this you take an

image and you separate it into a bunch of slices representing each representing a particular frequency band and then you go into one of those frequency bands and you knock out a rectangle for the picture and then you reassemble the thing and if you hadn't knocked that piece out when you reassemble it it would look exactly like it did when you started so what we're doing is we knock out as much as we can and still retain that neural Nets impression that it's the thing that it started out thinking it was so what do you think

this is it's identified by a neural net as a railroad car because this is the image that it started with how about this one that's easy right that's a guitar we weren't able to mutilate that one very much and still retain the guitar nests of it how about this one what's that what a lamp any other ideas Ken what do you think it is see he's an expert on this subject there's identified as a barbell what's that what cello you didn't see the little girl or the instructor how about this one what it's a grasshopper

what's this wow you're good it's actually not a two-headed wolf so two wolves they're close together that's a bird right good for you it's a rabbit what that Russian Wolfhound if you've been to Venice you recognize this so bottom line is that these things are an engineering marvel and do great things but they don't see like we see