hi everyone welcome back to day two of MIT intro deep learning thank you all for coming back for day number two hope everyone enjoyed day one today we're going to be talking about one of my personal favorite topics in this entire course which is how we're going to give computers the ability to have a you know an ability that many of us have you know which is the ability of sight and the ability to sense Vision so some background first of all so what is Vision vision is you know one of the most important human
senses that we all have it allows us to uh do everything from you know detect and interpret human emotion and facial expressions to navigate the world to manipulate objects and manipulate our world around us and today you know what we're going to do in the class uh today is just understand fundamentally how we can give computers this exact same sense as well the ability to understand you know what is what is in the physical world around it just from raw visual inputs and I like to think of this you know very super simple definition of
dis ability just by uh this quote right here this is the ability to know what is where by looking and today's class will basically be focused on what we look at will be images but in general this is the you know broad sense of ability that we'll be focusing on but you know vision is actually so much more than just understanding what is where but it's also understanding you know a sense of planning as well right it's not a static sense only but when I show you this image here for instance right what you see
is you know first you see the static things you see the things like the cars on the side of the road you see the people but then most importantly you start to as you look closer at this image immediately what you'll feel yourself doing is also getting a sense of Dynamics as well you know not only understanding what is where but also how things are moving you understand that even though these two uh cars are on the road one of them probably is uh completely static right the white van on the left is probably completely
static because it's parked on the side of the road and even though these are both cars you can detect them as cars they are perceived very differently right when we look at this image same as you know these people on the side of the the the the side of the road versus these people actually actively moving right and vision is actually all about not just understanding what these things are in this image but actually understanding and interpreting the image holistically in terms of how things are moving how things are changing temporally as well and it
accounts for all of these details in the scene we take this for granted right when we look at all these things even minor things like the traffic lights in the background right very very Minor Details that Define you know like the movement of even these future cars right we can infer a lot of things about this image the closer we look at them and deep learning is really starting to bring about a huge revolution in uh computer vision algorithms and their applications you know this has changed changing everything from robotics manipulation autonomy to mobile Computing
everyone in your pockets you have uh a lot of computer vision models running on these embedded Hardwares uh biology Healthcare right making sure that we can detect things the way that humans detect things from from live or from from raw imagery and raw medical imagery to autonomous driving and accessibility applications as well the real theme that you'll see throughout this class is how pervasive these algorithms especially today today's lectures are going to be focused on the pervasiveness of these computer vision algorithms because they are truly everywhere you know the one one example that you'll get
some exposure with in today's lab is going to be on facial detection right this is one of the original computer vision applications that was extremely impactful in the community it just basically allows us to detect not just if there is a face there but you know also features about the face understanding micro emotions about the face that are extremely hard to detect and extremely hard to interpret and of course we've all heard a lot of Buzz and a lot of hype around self-driving cars this is a classic computer vision problem right what we're doing here
is taking as input imagery you can actually see the image that comes in on the bottom right hand side of this uh video right this is the raw camera image that's coming into the the network and this is an endtoend Network this is a single model this is very different actually than how most autonomous cars operate it's a pipeline of mini models and predefined maps actually this was work that we did here at MIT where we created an autonomous vehicle it's actually located in the garage of this building uh to drive fully autonomously on brand
new seen brand new never seen before roads purely from visual input data and of course uh we already have have talked about this a little bit but we'll see some applications of of what you'll learn about today in biology Healthcare medicine and so on so I'd like to you know now start to dive into some of the technical basis of today's class and I'll I'll just preface all of this by saying that you know all of the amazing things that I showed in the previous slide that today's class will be one of the classes where
what we'll be training these computer vision algorithms to do are going to a lot of times be uh tasks that we as humans take for granted we do these things without even thinking about them and we do them with such ease and I think the the question that will really boil down to today's lecture is how we can take something that all of us take so for granted and distill that also into computers as well so in order to understand this first we're just going to start with the base most simple case which is understanding
just how a computer can process an image right not even just uh not even process it but let's say even just represent an image well to a computer images are just numbers right actually images are a lot easier than language because images are already repres presented as a array or a list a two-dimensional list of numbers so if we had for example this image of Abraham Lincoln that you can see on this slide here it's just made up of pixels right what is a pixel a pixel is a single number and since this is a
grayscale image uh every pixel is going to be represented by just one number if this was a color image every pixel would be represented by three numbers it's a composition of red green and blue uh values instead of just a single gray intensity value so we can represent this image nothing more than a two-dimensional array of numbers right one for one number for every pixel in this image and this answers our first question of how a computer can see an image so if we have you know like I said a color image this is going
to not be a two-dimensional array but rather a three-dimensional array it will be two Dimensions by a third dimension which is going to be this color axis which will always be of length three for most images and this is how we'll we'll proceed for the rest of the class today this is how we'll represent our images and there are two types of common machine learning or deep learning tasks that we'll be focusing on actually throughout this entire course but especially for the purpose of today and this is going to be the top the the tasks
of regression and the tasks of classification so let's start with both of these tasks so number one for regression what is regression regression allows us to Output a continuous value in our system right and classification on the other hand allows us to Output a class label so for example the the image that you see here could be mapped into a classification problem let's say we want to detect which US president is this image coming from and the answer can be one of K or n different classes right these are discrete classes and you can be
one of those n classes so in order to correctly classify images like you see on the left hand side what does our model need to be able to do or what do we need to be able to do in order to do this ourselves as humans well what we need is we need to understand well what is unique about every one of those classes on the right right what is unique about a picture of Abraham Lincoln that makes it different than a picture of George Washington right if we understand those unique features again then those
are the things that we can look for on the image on the left but the first step is actually we need to extract right and Define what are we looking for in each of those different classes right and that's this this task of feature detection right feature detection means nothing more than understanding what makes up one class different than another class right so another way to think about this is high level you know imagine classification is is just done by a two-step process number one defining what you're detecting in your images and then detecting those
things and if you detect enough of them then you classify it as that object so for example let's say for detecting you know an image of a face you may look for features like eyes o noses and ears if you detect enough of those in an image then you can have some confidence that you're probably looking at a face on the other hand things like houses are defined by features of doors windows and you know steps right if you if you look for those things in images you can have some confidence that you have identified
that if you successfully them so if you're building an image classifier it really boils down to nothing more than this right you want to First Define your features know what you're looking for and then you look for those things and if you find them then you can successfully classify your image now this is easier said than done right because one way to solve this problem of you know knowing what you want to detect this is a big problem right because knowing that you want to detect let's say uh you know a face by Looking For
Eyes now how do you detect eyes right it's a recursive problem now you have the same problem for detecting eyes that you had for detecting faces and it's a hierarchical problem and beyond that it's also a problem where there are so many different variations and so many different adaptations so you can imagine that you know faces look completely different depending on your Viewpoint it depends on your scale it depends on your uh you know occlusions and and things that are on different lighting and illuminations and so on there's so many different variations of a single
image in the in the pixel space right even that same object taken with two different cameras can appear very very differently and this is why computer vision is so difficult for computers even though we take it all for granted so our classification algorithms our models that we build in this class they have to be able to be invariant to all of these different changes and all of these different you know modifications in our feature space in the things that we're looking for in our image in order to build a very robust pipeline if we don't
have that level of of adaptability or uh invariance to these features then we will not be able to be very robust in our detection so it all really actually boils down to this last bullet here right so the detection of these features in order to classify is usually where machine learning breaks down and deep learning kicks in right so the machine learning world makes it very very difficult to detect these features Fe because defining the features in machine learning is a human process whereas in deep learning we're going to say okay we're going to Define
these features using data now what does that look like using a neural network we can actually start to say okay instead of me defining as a human saying that I want to detect Faces by looking for ears and eyes and noses and then defining you know recursively one more step how would I find an error I would look for you know like vertical lines followed by like a circle underneath the line something like this I'm not going to Define any of that as a human I'm going to show my image show my model a lot
of images of faces and it should be able to extract what is unique about all of these images of faces to help me disentangle all of those images of faces from pictures of cars let's say for example what are the unique features the unique styles of pixels that really differentiate these two things and let's do that in a hierarchical fashion the hierarchy here comes from the layers in my model so the depth of my model curates hierarchy and every layer of depth that is composed on top of each other allows us to inject yet another
layer of complexity or expressivity like we saw yesterday right now neural networks all they do is allow us to learn this hierarchy implicitly as opposed to explicit definition instead of you know by constructing them manually by humans so in lecture one we learned about how we could use fully connected networks to do this type of task actually right this shouldn't sound so new so far because what we've been talking about is you know simply just a composition of a multi-layer network where every layer is composed from the outputs of the prior layer and every neuron
in every get every given layer is going to be simply just connected to every input in that prior layer so let's say we wanted to use what we learned from lecture one in this problem definition of what we're seeing right now with computer vision how could we do this let's actually try it out so in our case right now unlike lecture one now we have a two-dimensional input previously in lecture one we just had one-dimensional inputs of features coming in to our model but now we have two Dimensions because we have an image so let's
feed that two-dimensional input into our fully connected Network what would this look like so already you know we we to a problem right because what we have to do here is we have to flatten our image our two-dimensional image into a one-dimensional input because all our fully connected networks can handle are one-dimensional inputs that's what we learned about in lecture one so the only way to you know use a fully connected Network for an image processing task would be to you know conform the images into a shape that these fully connected networks are capable of
processing now automatically hopefully everyone in the audience you can all understand and appreciate that absolutely 100% of the spatial information that was previously stored in this like very rich uh very rich data type of an image is now completely gone we've destroyed 100% of that spatial information by flattening it and additionally we also have a lot of parameters in this model because every single Pixel in my input layer is going to be connected to every single neuron in my hidden layer and that's only one layer and then you have that you know depthwise many many
times so what we have is a extremely expensive model for a actually a suboptimal solution where we've thrown out a lot of very valuable spatial information so instead what we want to do here is you know ask us ask this question on the bottom right we want to ask this question of how can we preserve and leverage all of this really really rich and unique spatial information present in our data natively this is just part of the data just to inform our Network architecture we don't want to throw all of this out for no reason
we want to leverage it and amplify our capabilities with it so to do this let's actually represent our two-dimensional image as its true form right let's represent it as a two dimensional array of pixel values and one way that we want to use now the spatial structure inherent in our input is just to connect the input pixels in patches to our neurons so previous we had connected everything to everything but in reality what we're going to change now is say okay we'll keep the two-dimensional structure but we will only connect things in our input pixel
space to Output neurons that are locally connected to each other in a patchwise fashion so for example this one neuron here will only see inputs from a local connectivity structure of patches of pixels from this patch here on the left likewise you know this is actually repeated across the entire image this one patch that you see on this top left hand corner will give rise to this one output neuron but the very next output neuron right next to it will be defined or Guided by the next patch shifted one layer over so everything is not
connected to everything now there is still a lot of overlap in information because you'll see that this neuron does share a lot of input pixels with this neuron right there is overlap there but there's not complete uh you know fully connectedness of the entire model and we have successfully preserved a lot of spatial information we have you know enforced basically by definition that you know the neurons close to each other in the output layer will be derived from pixels that are close to each other and the input layer right and this is how we preserve
spatial information now in practice what I what we just talked about is nothing more than an operation a mathematical operation called a convolution so what is a convolution let's first actually think of this at a high level and suppose we have what's here represented as a 4x4 filter I'll use that word now so a filter is just you can see this red box that's going to be I'm going to call that filter this is going to be consisting of 16 so 4x4 pixel weights right so you'll have the 4x4 pixels that come from the image
you will also have four 4 by4 weights that are coming from the filter and then this thing will slide over my image step by step and apply the same operation that we saw in lecture one right element wise multiplication out a bias apply a nonlinearity right no difference here except everything is not connected to everything you'll only do this on the filter level right not for everything but just for these 4x4 filters as you move across the image and we're going to then just shift our filter after every time that we do this operation we'll
shift it over and over and over again let's say by one pixel or two pixels over so that we can scan across the entire image so probably you're already wondering right the next layer of complexity on top of this would be to really understand you know how the convolution operation actually allows us to extract these features so far I've only defined how the convolution operation works but how is it allowing us to Define what we started D lecture talking about is we want to Define you know the the things that we should look for in
our image and how does a convolution operation allow us to extract that information so let's make this concrete by just walking through a couple examples so suppose we want to classify or detect X's from a set of black and white pixels uh so these are just you know images they're two-dimensional arrays of numbers in this case they're not going to be gray they will only be black and white so everything must be either a plus one or a minus one and I want to build a model that can determine does the left side equal the
the right side right so I have I have some examples of x's and I see this new example on the right that is a bit of a skewed rotated X it's not a perfect X so I cannot simply just compare the pixels on top of each other I have to detect them more intelligently right so what we want to do is actually classify things on the right if they are X's or not even if they are shifted they're scaled they're shrunk rotated deformed whatever the case may be we want to be able to still have
a very reliable detection so let's think about this so how can we do this using the same principles that we just saw we can do this by you know having our model compare the X's not globally but what if we did it on patchwise so if we looked at for example the patches of key interest the things that define an X what defines an X actually there's only a few things that define an X there's the you know the diag is this way the diagonal is this way and then a Crossing between the two right
so here you can see three features that we're looking for we're looking for a diagonal going from left to right we're looking for a diagonal going right to left and we're looking for a crossing right and if we do this comparison patchwise actually you can see that even though these features these things that we're detecting they may be in different places so the green diagonal is not in the exact same place but it's very close but we are able to detect those features right and if our model can find rough matches on the patch level
then again we've successfully detected you know the things that define an X that allow us to confidently say okay this is probably an X at a global level as well if I can find enough of these patterns so each feature basically is nothing more than like a mini image right it's a miniature image it's a small again two-dimensional array of numbers no different than the Big Image as well but what we're going to use is or what we're going to do is use these filters these convolutional filters to detect and to define those patches right
so in this case the the patches or the filters you know represent like we saw diagonal lines and crossings of lines these are all the important characteristics of what make up an X but in practice you know probably we're going to want to capture these these features for any type of new X that come in right not just for this one X and they will be robust to those types of peration so let's uh let I mean okay so this is again like this is no different than convolution but I'm going to keep kind of
hammering the point home and start to actually introduce some of the mathematical terminology as well so that you can start to familiarize yourself and how that translates from the intuition that you've seen on the past few slides to the mathematics on the next few so what does convolution do so the operation that does exactly what we seen illustratively in the past few slides is convolution but what does it do it preserves the spatial information that is present within the images by breaking the image the Big Image down into smaller sub images and then searching for
key pieces of information across those sub images and to do this it does this by basically doing an elementwise multiplication between what it's looking for and the original image at every point along the image so let's say it's looking for this uh uh this diagonal going from left to right right so the filter is going to be the following 3x3 miniature image that you see on the top left this is a threedimensional matrix or sorry a 3X3 Matrix a 2d Matrix but it's 3x3 and in this case all of the entries uh here or actually
excuse me all of the entries are minus one except for the diagonal which is positive one this is because we're looking for this diagonal line and we element wise multiply this filter this is what we're looking for we multiply it with every patch in the image so we'll start right here this is the actual true patch right we'll look right there we element wise multiply the patch on the top left with the patch of pixels coming right here and what we get when we multiply every number on the top left with every number in this
green box we'll get another 3x3 Matrix with everything inside of it equal to one because there's a perfect match between every pixel in the top left filter with every pixel in our bottom or our image patch okay now that we have this element the result of our element wise multiplication we'll do exactly what we saw in lecture one we take the result of element wise multiply we add them all up and we pass them through a nonlinearity which we'll see soon let's consider one more example before moving on so suppose we want to compute the
convolution of this 5x5 image in green with this 3x3 filter so to do this we need to convert uh or we need to cover the entirety of our image with this filter so we can do this by sliding our filter not just one time like we saw in the previous slide but across the entire image step by step in piece by piece and then at each step doing the same operation element wise multiplication and addition so let's see what this looks like right so first off we start with the top left corner we place that
yellow filter on on top of the top left corner of our image we element wise multiply that 3x3 patch with all of the pixels in that 3x3 location and the entries of all of those things and then we add up all the entries after the element wi multiply and we get the answer of four okay and we're going to put that at this neuron location so this is the output of our top left neuron of the first layer of our Network then we slide over one step over and we do this exact same process again
now there's a little bit less of an overlap with this filter so what we're going to see is that the output of this neuron right next to it is three and we can keep repeating this over and over and over again until we slide over the entire image and fill over the entire uh all of the patches right and that's it right so now what we have done is we've defined an operation that can go entirely from the input image on the left given an input image and given a patch or excuse me given a
filter in the middle Define what is the output layer on the far right yes what what's your choice of filter yep exactly exactly so so far we haven't talked about this at all right so all we've said is we're defining an operation given an image given a filter how to compute the output answer and you're completely right that the next step is to say okay we want to learn those filters because the filters actually dictate how good of an output answer you you get right that's actually the tunable knob that will dictate this entire process
so we'll get to that in the next part of the lecture yes great question and actually to motivate this I mean there's a couple things that you can do you for a long time people have hand engineered these filters you don't even have to learn them right so if you want to detect lines actually there's like very good filter you don't need to learn them it's like easy to design a filter that can detect edges right this is a filter that detects something going from you know a very high brightness value to a very low
brightness value it's a it's a type of derivative filter right um and in fact people for many years have hand engineered filters to serve different types of purposes here you can see some examples of different filters so given an input image on the left you can see you know these are hand engineered filters for sharpening images right hand engineered filters for detecting edges or even stronger Edge detection in the filters right if you look at these filters you can actually interpret them meaningfully right the sharpening filter amplifies the the center and then minimizes the edges
right and it's when it slides over the patches you get that sharpening effect the edge filter is again a form of a derivative right kind of the opposite effect so simply by changing the weights of the filter as the question you know uh implied you change entirely the output of the of the next layer so the filter has a lot of importance and defining your filter can have a lot of impact on looking for the things that you really care about in your image so hopefully now there's very clear understanding not just like on how
we can compute the output of your convolution but the importance of preserving all of this spatial structure and maintaining this through the convolution operation now now that we've gotten all of this about you know the convolution operation under our belts I think now it's time for us to kind of transition and say okay how can we yes go ahead what does the overlap with the zero mean because with one and minus one it made sense the multiplication and then we get one but when we have zero in the filter and zero in the image what
does that comparison mean it's still a comparison right so if we're looking for zero in the image and zero so it would be Zer times zero so it's like a it's like a null passive information that you're you're not looking for anything in this location right so a zero in your filter is almost saying that you're you it's a it's a null piece of information in this detection right okay so let's transition now to say okay how can we understand not just what the convolution operation does but how we can start to define the inputs
of the convolution operation we already saw obviously the input image is well defined right we give this as as an input to the model but what about the the filters themselves and to do this you know we can zoom out a little bit and see where we're going to end up with the entire convolutional neural network let's consider you know a simple CNN a CNN stands for a convolutional neural network designed for image classification so the goal here is to learn these filters or these features they're the same thing same the same word two words
just different same meaning uh we want to learn those features from image data so given a bunch of images on the left hand side we want to learn what are the filter or the features that can disambiguate the different classes that can come out on the right hand side and there are three main operations to cnns we already saw one of them right so the first one is convolutions Right convolutions define this processing of patchwise operation scanning across our images the second we also have hinted at already it's the nonlinearities that come after each of
our convolutions and then finally the third piece is going to be pooling which is just a down sampling operation that allows us to you know grow our receptive field right because if we're doing this thing patchwise over the image we also need to have different layers of you know depth like how big our patches are so by down sampling the image we're effectively upsampling our patches in the same way so we can train our model on a set of images and in training them what we're going to do here is learn those filters right right
now we'll go into more depth on this but when we Define the model with this framework you know all we're doing here is you know via back propagation you define all the operations you show the data and then you can compute the optimal set of weights or the optimal set of filters that will optimally solve this task and you do this through optimization so let's go through each one of these operations just to break this down even further and see how this looks even more concretely so let's dig even deeper into the operation in the
CNN and as we saw before each neuron in our hidden layer so the output of the first layer of our model this is just going to be uh computed with this weighted sum of our inputs with our filters we apply a bias we add up everything and we uh apply our nonlinearity right now what's special here is really I want to highlight this point about the local connectivity the local connectivity allows every neuron in the hidden layer on the right to only see this patch of input pixels on the left right that's really important because
now we can really Define this computation in math Right Moving from all of the illustrations that we saw before for a neuron in the hidden layer its inputs are the neurons in the patches in the previous layer so what does that look like we can apply this Matrix of Weights also we've been calling them filters or features this is a 4x4 Matrix in this case to do this element wise multiplication we add the outputs apply the bias and apply our nonlinearity same stories before but remember that our elementwise multiplication and addition is nothing more than
convolution this is literally just convolution that we talked about earlier so this defines how neurons can operate in this forward pass right but within a convolution operate within the convolutional layer we don't do this for only one filter right every layer does this for a whole set of filters so you may say one layer is looking for a wide variety of different types of edges it may learn to look for those filters in a you know a huge variety of edges and then use that in the next layer in the detection of those many different
types of edges so what you'll see is actually in this slide you're not outputting just one two-dimensional image of your filter output you're outputting an entire volume and the volume is derived from not just one filter but now an entire depth wise of filters and again this is just one layer of your convolutional neural network so one layer in your network takes as input again a volume but let's just call it an image on the first layer it takes as input your image and it detects n different filters across that image it gives you back
a new volume which is that image convolved with those nend filters now the next step after we have that volume as we saw before is we compute or pass them all through our nonlinearity this gives us our nonlinear uh capacity and our expressivity that we saw about and we care about in the first two lectures we've been talking about this um now a common application for images in particular is to use nonlinearities like the reu activation the rectified linear unit that you can see here why may you want to do this is because uh the
reu the relu activation is basically a form of a thresholding function right it gets rid of all negative values it squashes all negatives to zero and it keeps everything positive as what it is so you can you can kind of think of this as a very intuitive meaning in computer vision as a thresholding function right so you just want to get rid of everything and squash it all to be uh to be positive yes is to prepare the is that correct but you don't do any fions show last yes so the convolutional step is to
detect features within the input and then the nonlinearities is to increase the expressivity of those detections right to make them nonlinear detections now you don't do any more convolutions in that first step but you do have a deep network with many of those layers so each layer does one convolutional operation but then you stack them sequentially in depthwise yes so I see no no so actually convolutions will not necessarily change the spatial resolution of the image at all right so as you can see here right so I mean you you it shouldn't need to change
the spatial resolution you can Implement convolution in different ways right but let's say in the example here right if you pass over your filter typically filters are much smaller than the size of your image right so you'll have a output size resolution that is roughly the same size as your input image so spatial resolution does not change as a function convolution there's different ways to implement it where that may not be true we don't have to get into that now but generally let's let's say it like this uh you change the spatial resolution to do
this like let's say multiscale operation of detection of features so you do it one time and you're detecting features at one scale then you downscale this volume and then you repeat it again the next time you repeat it you're now detecting features at a bigger scale because you've now downscaled your volume right and this this repeated process is what allows you to detect features not just at the original size but repeatedly across many sizes right and and to also have those features Compound on each other right you may want your features to be hierarchical so
you want to say okay I don't want to also only detect edges very basic features but I want to compose okay this Edge combined with this Edge that can only be possible if I take it from the output of the second convolution not the first convolution because you want compositions of detections not just original detections yes so filters will always be different because filters are learned they're not defined by human right so filter So within a single layer and across layers filters will you know always be different because they are going to be randomly initialized
and then optimized by data so you'll always have a set of filters when everything is said and done you'll have a set of filters that will have very little overlap with each other throughout your network unless you have a huge Network and you'll must have overlap basically exactly yes yes we'll see some examples so actually let me let me go forward a couple slides because this will this will become much more parent very soon now we we touched on this already but why do we do something after convolutions we do this pooling operation pooling is
nothing more than downscaling our dimensionality so after every convolution we get an output set of features that are detected we downscale those features and then we repeat the convolution again because the next time we do the convolution we will have now a larger receptive field because our input image is now smaller a common technique here to do this down pooling is something called Max pooling this is basically just saying okay you'll take a a patch on the left hand side and you'll pick the maximum value from that patch and you'll put it in this 2
by two and from a 2 x two you'll put it in a 1 by one single maximum value there's many ways to do uh pooling and dimensionality reduction Max pooling is only one way you can think of you know ways that are less uh for example uh harsh on derivatives Max pooling is actually quite harsh because derivatives don't pass through through you know 3/4 of the pixels by doing it like this you'll only have a derivative through the one pixel that activates the the maximum imagine for example if you did mean pooling instead you would
have stable gradients throughout the entire patch instead so you can think of clever ways to do other forms of down pooing as well this is a it's a very open uh an open Point okay but getting back to this point of you know how do we interpret and how do we see the filters that are being learned actually you can as the model trains or after the model trains also you can visualize those filters right so you can actually look at what the convolution neural network what the convolutional layers are optimizing themselves to detect in
this image so here's an example of a convolutional neural network uh learning to detect faces and you see basically three different levels of activations of filters that are being activated by this network in the beginning it's looking for basically these edges like changes in intensity value with different directions As you move up in the model you start to combine those filters and those features together that are being extract hierarchically and you start to see actually facial features start to emerge things like eyes noses and ears like we were talking about before and then you move
up even more than this and you start to see full structures start to emerge right now with cnns the key Point here that are different than machine learning is that we didn't Define any of these filters ourselves we've learned these filters by showing the model a lot of data and asking the model hey what are the filters that would be best for you to use in order to get your accuracy of detecting faces as high as possible right we only penalize it when it's not able to detect faces and we say hey keep changing your
filters until you're able to detect faces better and this is what it comes up with these are the filters of things that it it's learned to detect in order to do this task most accurately so roughly actually now this whole process is just broken down into two parts right part number one is extracting these features using those learned filters and part number two is going to be using those features to do some form of detection right in this case we're we're talking about facial detection right so there are these two pieces right how can we
take the features that are learned from phase one on the model to inform our classification phase right we can do this with a function called a softmax function what is a softmax function this is just a function that takes our n outputs at the last layer and squashes them to have a sum to one why do we do this because we're caring about classification in this case and if you want to classify something you have to have a probability distribution what is a probability distribution you have to have mass that sums to one okay so
now let's put all of this together in code to build our first endtoend convolutional neural network we can start by defining those two parts of the network that we saw in the previous slide first part is going to be the feature extraction part this is the part with all of the filters this is going to have 32 feature maps you can see right here so the first volume that comes out of this first layer will have a volume a depth of 32 features corresponding to 32 filters that have been learned in this first layer we
will downo the spatial information and then we'll pass it onto yet another convolutional layer that takes those 32 features and learns another 64 features on top of them at a smaller Dimension we're going to then flatten all of this now we can remove our spatial information we go into a one-dimensional space because now we're actually wanting to make a one-dimensional prediction on the classification so we flatten everything because we've already learned our spatial information and then we project to a softmax function that allows us to predict our probability distribution and allows us to Output our
final let's say 10 classes right if we're trying to predict numbers we would have zero to nine different classes that's 10 total classes in pytorch you'll see a very similar uh very similar uh version of this code right almost identically mirroring what you saw before right again we defined the convolutional layers 32 followed by 64 features a flattening operation that allows us to enter the one-dimensional space and then pass into our final output dimensionality yes you say a few words about how you choose that there's three the patch is a size three and there's 32
feature Ms and layers te me that might be more of an art than a science yes yeah it it's an art but it's also uh it can be guided by a lot of uh uh intuition I guess so some intuition that I'll provide is that typically you want your features especially depending on the task you want to start out quite small but you want to preserve you want to uh upsample spatial information as you go deeper into the model so you want to make sure that you understand first of all what are the based on
your image based on your data set you want to understand what you know looking at this as a human what are the scale of the features that this model should be looking at to detect in the beginning and at the end and those will define basically those two resolutions right if you're saying okay this is a massive image that's let's say 2,00 by 2,000 pixels wide starting with 3x3 will be too small right probably for most images right it doesn't make sense to look at that small scale if your image is 2,000 pixels wide but
then there could be other applications you know it's very small and even 3x3 is is already like a significant resolution in that image right so it's very problem dependent as well yes yeah sorry you TR your your allow not if it wasn't trained on faces that were sideways this is a great question so the question was uh you know you train it on upright faces and then you test it on faces that are turned sideways and no it it we would not expect it to do well in this case because it has learned to detect
features of faces that were always upright if it showed if it was shown faces of both orientations then we would expect it to do this successfully right because we would expect that then it should learn that faces can be in either orientation so it should learn to you know pick out those features in both cases yes how would this approach work for projectable object for example if we have a a gun bear or if we have a flue simulation like how uh is it a at all or um it's it's very problem dependent but again
we'd want to make sure that we can learn it from data right so in even in these more Dynamic cases right we'll see some examples later in the in the deck right of of these types of situations as well but really we want to make sure that these are all learnable from data okay I'll I'll continue on just because we we have a lot more to get through so so far we've talked about cnns only for classification tasks but in reality this architecture extends far beyond classification as well remember this picture I showed you before
I think this is a really helpful picture because it decomposes all cnns into two pieces right a feature extractor and what we previously saw was a classifier but actually the second piece can be really anything you could take a feature extractor and use it with an object DET tecture a classifier a segmentation Model A probabilistic control model a regressor model or so on right so in this portion we're actually going to look at what are all of the different types of models that you can create just by keeping this left side static use the same
feature extractor but now let's see if we change the right hand side how can we achieve a lot of very different things and different types of model that look very different but in reality they're actually not so different at all so in the case of classification we saw this already right that that right hand part of the network is nothing more than a flattening operation over all of our features and then we classify Over N different classes right so here for example um you you see an example basically classifying binary right uh diagnosis or not
diagnosis this is the case a paper that came out a few years ago that basically demonstrated that CNN can outperform Radiologists for these types of diagnosis tasks uh for mamogram images classification is an extension of you know or classification let me say it like this classification is a binary problem typically or or a k set problem it doesn't have to be two classes but it could be K classes let's go one layer deeper right so this is an image of a taxi classification would be me saying okay I input this image and I predict taxi
right how about object detection object detection would be not just predicting the Class Taxi but it would be predicting a couple more things it would be predicting a location of that object a class for that object and an an exact bounding box as well so it's not just a location of the center of the object it's a location of the entire bounding box of that object and the neural network has to tell us for not just this one object but actually for every object in this scene where it is and also the class of that
object so our model needs to be extremely flexible in order to do this type of task because now it's not outputting just a you know n sets of different outputs or k sets of classes it will output if the scene has only one taxi it will output one bounding box but on the other hand if the image has many objects right it should be flexible enough to have many outputs as well right so potentially even these outputs could have different classes obviously different locations different uh you know uh different numbers and ordinals as well so
how can we how can we accomplish something like this this is very complicated because number one those boxes that we saw they can be anywhere in the image they can be of any size and they can be of any ordinality you can have many many of them or you could have no boxes actually in your image so let's consider a very naive way first right let's take our let's take our image on the left hand side and let's start by placing one random box over the image right we'll just place a r we'll pick a
random box we'll put it over that image and we will take that one random box and we will pass it through a CNN that we trained before right the feature extractor and then we'll use that feature extractor in the old way we'll use it as a classification model even though we talked all of this stuff about you know making this very flexible we'll just do this in the the old way first but we will do this repeatedly we'll keep picking random boxes and we'll say okay is there anything in this box no okay go to
the next block and we'll keep doing this over and over and over again for enough boxes in the image we'll fill up our entire image with a bunch of classifications uh and simply discard the box if there was no class detected right now the there's a lot of problems with this but the main one is that uh there's way too many inputs here it would be way too computationally expensive and even for any practical image you would never completely have coverage over this image for all of the types of boxes as well there's a exponential
explosion of number of boxes with respect to resolution so you can't simply do this type of naive method but it gives us a good starting place so instead of picking random boxes what if we were a bit more intelligent and we used aisti method so actually there was pretty simple istics that we could build that you know pick boxes around things that look like they they have stuff happening in that part of the image right you don't have to know anything about the image but just looking for like blobs in the image you can create
a pretty decent heris that I just identifies okay here's a here's a potentially interesting piece of the image I'll draw a box around it and pass it through my classifier but still actually this is still very slow because I need to draw a lot of boxes for this to work well and we have to feed in each box independently to to our classification model so it's very brittle because now also we've completely disconnected these two parts of the problem we talked also a lot about you know this end to-end model that goes from feature detection
to classification but now I've broken that model I've said okay I'm going to have a proposer a box proposer model over here that will propose boxes for me and I'll have another classification model which takes the boxes and processes them but if uh you know ideally we want those two models to share information in order to predict good boxes it should know what it's looking for at a high level right so how can we solve this uh this is one way that that you'll see and and hear a lot about I'll I'll present it very
briefly so everyone is familiar with it it's called a rcnn method right so this is a regional CNN convolutional neural network method what it attempts to do is actually learn those regions those boxes it attempts to learn it within the same network as the model uh you know that that classifies those boxes so what does this mean this means that we can start from the bottom we start from our original image and we have this you know region proposal Network the region proposal Network takes us input the image and it proposes regions right it learns
to extract features of where are high important highly or or importance uh or excuse me high importance regions within the image and this is used to define the boxes that I use for the classification model so I take those those regions I draw boxes around them and then I pass them through the same classifier using the same features that were learned to predict those regions so now there's good alignment end to endend in this entire process this requires only a single forward pass to do this entire model now so it's very efficient compared to the
models that we saw before and it's very accurate because it's actually sharing features along this process okay so in classification we saw that we went One Step Beyond uh sorry we we started classification then we went to object detection goes One Step Beyond classification we're go one step even now Beyond object detection to segmentation so instead of just predicting boxes in the image what if we predicted for every single Pixel in our image a classification so now given an image on the left hand side we want to predict another image on the right hand side
with a classification value for every single Pixel in that image right so one example of this this is just shown here this is a semantic segmentation model semantic means that we're learning the semantics right the the the classes of every pixel in our RGB image on the left hand side so here you can see basically an image of a a cow on the leand side being semantically segmented into you know a few different categories grass cow looks like trees and sky in the in the background as well so how is this done the first part
of this network on the left this is the same feature extractor that you saw before it's composed of convolutions down pooling and nonlinearities the second half is no longer a classification model in the in the one-dimensional case but now it keeps everything in two Dimensions but it upsample convolutions now so instead of going convolution followed by down sampling we do a convolution followed by upsampling and again we learn the output on the far right to be just a classification of all of this so this is nothing more than another classification problem where we didn't down
sample to one Dimensions we just kept everything in two dimensions and we uh used upsampling instead of down sampling let's see one final example so let's say we want to learn a neural network for autonomous navigation like we saw in the beginning part of the class for self-driving cars and let's specifically say we want to learn a model model that can go from raw perception camera data uh to and and as well as you know noisy Street View maps let's say like what you would get on your you know Google Maps on your phone for
instance right these are like bird's eye view pictures of where the car is at this point in time uh but they're just pictures right and you want to feed both of these into a neural network and allow the neural network to learn how to drive based on these two pieces of information if you think about it actually this is all that humans need to drive as well right we only use two and we use Google Maps if we go to a new city we need some layer of navigation information to drive through that city and
what we want is we want to directly infer not just one steering wheel angle where the steering wheel should be but we want to infer a full probability distribution again it's a it's a form of a classification problem but now we're not just predicting over K discrete classes we're going to learn a continuous probability distribution because steering wheel angles are continuous they can be anything from what let's say like negative 180° to positive 180° it's a continuous value so this entire model can be trained using exactly the same techniques that we learned about today we
can pass through each of those uh images on the left hand side through the same types of modules that we've learned about these are convolutional uh convolutional layers followed by nonlinearities followed by down sampling right all of that gives us features that come from each of those images at the output we combine all of those features in one dimension and then we use that to learn the control parameters of the vehicle and then we regress on that with back propagation we learn the features and we learn the filters that are optimal for this type of
task by watching a lot of driving data and learning okay what are the good features that I should learn in order to do this task successfully right and you know in the end the the result here is that you know this is a autonomous car that uh has never driven through this road before it can actually drive brand new roads never seen before without Maps this is something that's actually very different than how you'll see like whmo cars Google cars are quite different than this because they require to drive through a city once human driven
before they can drive autonomous right humans don't operate like this right we can be put in a brand new city for the first time and we can successfully drive in that City without ever seeing it before right and this type of model is also able to exhibit that type of behavior so the the impact I'll I'll just conclude briefly right because the impact of cnns has been very wide reaching Beyond these examples that we've talked about today but hopefully what you can appreciate is that in all of the examples actually they all come from these
exact same fundamental techniques right it's Evolutions the operation itself its feature learning to to learn those filters and then some amount of some combination of either up sampling down sampling to to preserve or modulate spatial resolution so I'd like to basically you know summarize now all of this so first we considered uh you know a lot of the origins of computer vision we took a you know uh we took some some history back through how images are represented they're represented as these two-dimensional arrays of values we we you know went one layer beyond that to
say okay how could we learn from those two dimensional arrays using convolutions to extract meaningful pieces of data and then formulate that into full networks not just a single convolution operation and then many applications with that backbone of convolutional layers many applications that can be spun off very easily of different types of models from that point anything from classification to detection to segmentation and know many others as well I'll pause there and uh I'll transition to Ava who will be talking about uh generative deep learning as well which is actually how we can there a
brand new uh thing that we haven't talked at all about in this class is not just how we can learn from data but to learn from data to generate more data right it's a it's a very different it's not only a different type of model but completely different Paradigm than everything that we've seen so far in the class so we'll take a couple minutes we'll do a transition and then we'll continue in probably 2 minutes thank you