MIT 6.S191 (2023): Convolutional Neural Networks

269.86k views9464 WordsCopy TextShare

Alexander Amini

MIT Introduction to Deep Learning 6.S191: Lecture 3 Convolutional Neural Networks for Computer Visio...

Video Transcript:

Hi everyone and welcome back to Intro to Deep Learning! We had a really awesome kickoff day yesterday so we're looking to keep that same momentum all throughout the week and starting with today. Today we're really excited to be talking about actually one of my favorite topics in this course which is how we can build computers that can achieve the sense of sight and vision. Now I believe that sight and specifically like I said vision is one of the most important human senses that we all have. In fact sighted people rely on vision quite a lot

in our day-to-day lives from everything from walking around navigating the world interacting and sensing other emotions in our colleagues and peers. And today we're going to learn about how we can use deep learning and machine learning to build powerful vision systems that can both see and predict what is where by only looking at raw visual inputs. I like to think of that phrase as a very concise and sweet definition of what it really means to achieve vision but at its core vision is actually so much more than just understanding what is where. It also goes

much deeper takes the scene for example we can build computer vision systems that can identify of course all of the objects in this environment starting first with the yellow taxi or the van parked on the side of the road but we also need to understand each of these objects at a much deeper level not just where they are but actually predicting the future predicting what may happen in the scene next for example that the yellow taxi is more likely to be moving and dynamic into the future because it's in the middle of the lane compared

to the white van which is parked on the side of the road even though you're just looking at a single image your brain can infer all of these very subtle cues and it goes all the way to the pedestrians on the road and even these even more subtle cues in the traffic lights and the rest of the scene as well now accounting for all of these details in the scene is an extraordinary challenge but we as humans do this so seamlessly within a split second I probably put that frame up on the slide and all

of you within a split stepping could reason about many of those subtle details without me even pointing them out but the question of today's class is how we can build machine learning and deep learning algorithms that can achieve that same type and subtle understanding of our world and deep learning in particular is really leading this revolution of computer vision and achieving sight of computers for example allowing robots to keep pick up on these key visual cues in their environment critical for really navigating the world together with us as humans these algorithms that you're going to

learn about today have become so mainstreamed in fact that they're fitting on all of your smartphones and your pockets processing every single image that you take enhancing those images detecting faces and so on and so forth and we're seeing some exciting advances ranging all the way from biology and Medicine which we'll talk about a bit later today to autonomous driving and accessibility as well and like I said deep learning has taken this field as a whole by storm in the over the past decade or so because of its ability critically like we were talking about

yesterday its ability to learn directly from raw data and those raw image inputs in what it sees in its environment and learn explicitly how to perform like we talked about yesterday what is called feature extraction of those images in the environment and one example of that is through facial detection and recognition which all of you are going to get practice with in today's and tomorrow's Labs as part of the grand final competition of this class another really go-to example of computer vision is in autonomous driving and self-driving Vehicles where we can take an image as

input or maybe potentially a video as input multiple images and process all of that data so that we can train a car to learn how to steer the wheel or command a throttle or actuate a breaking command this entire control system the steering the throttle the braking of a car can be executed end to end by taking as input the images and the sensing modalities of the vehicle and learning how to predict those actuation commands now actually this end-to-end approach having a single neural network do all of this is actually radically different than the vast

majority of autonomous vehicle companies like if you look at waymo for example that's a radically different approach but we'll talk about those approaches in today's class and in fact this is one of our vehicles that we've been building at MIT in my lab in CSL just a few floors above this room and we'll again and share some of the details on this incredible work but of course it doesn't stop here with autonomous driving these algorithms directly the same algorithms that you'll learn about in today's class can be extended all the way to impact Healthcare medical

decision making and finally even in these accessibility applications where we're seeing computer vision algorithms helping the visually impaired so for example in this project researchers have built deep learning enabled devices that could detect Trails so that visually impaired Runners could be provided audible feedback so that they too could you know navigate when they go out for runs and like I said we often take many of these tasks that we're going to talk about in today's lecture for granted because we do them so seamlessly in our day-to-day lives but the question of today's class is going

to be at its core how we can build a computer to do these same types of incredible things that all of us take for granted day to day and specifically we'll start with this question of how does a computer really see and even more detailed than that is how does a computer process an image if we think of you know site as coming to computers through images then how can a computer even start to process those images well to a computer images are just numbers right and suppose for example we have a picture here of

Abraham Lincoln okay this picture is made up of what are called pixels every pixel is just a dot in this image and since this is a grayscale image each of these pixels is just a single number now we can represent our image now as this two-dimensional Matrix of numbers and because like I said this is a grayscale image every pixel is corresponding to just one number added that Matrix location now assume for example we didn't have a grayscale image we had a color image that would be an RGB image right so now every pixel is

going to be composed not just of one number but of three numbers so you can think of that as kind of a 3D Matrix instead of a 2d Matrix where you almost have three two-dimensional Matrix that are stat stacked on top of each other so now with this basis of basically numerical representations of images we can start to think about how we can uh or what types of computer vision algorithms we can build that can take these systems as input and what they can perform right so the first thing that I want to talk to

you about is what kind of tasks do we even want to train these systems to complete with images and broadly speaking there are two broad categories of tasks we touched on this a little bit in yesterday's lecture but just to be a bit more Concrete in today's lecture those two tasks are either classification or regression now in regression your prediction value is going to take a continuous value right that could be any real number on the number line but in classification your prediction could take one of let's say k or n different classes right these

are discrete different classes so let's consider first the task of image classification in this test we want to predict an individual label for every single image and this label that we predict is going to be one of n different possible labels that could be considered so for example let's say we have a bunch of images of U.S precedence and we want to build a classification pipeline to tell us which President is in this particular image that you see on the screen now the goal of our model in this case is going to be basically to

Output a probability score a probability of this image containing one of these different precedents right and the maximum score is going to be ultimately the one that we infer to be the correct precedent in the image so in order to correctly perform this task and correctly classify these images our pipeline our computer vision model needs the ability to be able to tell us what is unique about this particular image of Abraham Lincoln for example versus a different picture of George Washington versus a different picture of Obama for example now another way to think about this

whole problem of image classification or image processing at its high level is in terms of features or think of these as almost patterns in your data or characteristics of a particular class and classification then is simply done by detecting all of these different patterns in your data and identifying when certain patterns occur over other patterns so for example if the features of a particular class are present in an image then you might infer that that image is of that class so for example if you want to detect cars you might look for patterns in your

data like Wheels license plates or headlights and if those things are present in your image then you can say with fairly high confidence that your images of a car versus one of these other categories so if we're building a computer vision pipeline we have two main steps really to consider the first step is that we need to know what features or what patterns we're looking for in our data and the second step is we need to then detect those patterns once we detect them we can then infer which class we're in now one way to

solve this is to leverage knowledge about our particular Fields right so we if we know something about our field for example about human faces we can use that knowledge to Define our features right what makes up a face we know faces are made up of eyes noses and ears for example we can Define what each of those components look like in defining our features but there's a big problem with this approach and remember that images are just these three-dimensional arrays of numbers right they can have a lot of variation even within the same type of

object these variations can include really anything ranging from occlusions to variations in lighting rotations translations into a class variation and the problem here is that our classification pipeline needs the ability to handle and be invariant to all of these different types of variations while still being sensitive to all of the inter-class variations the variations that occur between different classes now even though our pipeline could use features that we as humans you know Define manually Define based on some of our prior knowledge the problem really breaks down in that these features become very non-robust when considering

all of these vast amounts of different variations that images take in the real world so in practice like I said your algorithms need to be able to withstand all of those different types of variations and then the natural question is that how can we build a computer vision algorithm to do that and still maintain that level of robustness what we want is a way to extract features that can both detect those features right those patterns in the data and do so in a hierarchical fashion right so going all the way from the ground up from

the pixel level to something with semantic meaning like for example the eyes or the noses in a human face now we learned in the last class that we can use neural networks exactly for this type of problem right neural networks are capable of learning features directly from data and learn most importantly a hierarchical set of features building on top of previous features that it's learned to build more and more complex set of features now we're going to see exactly how neural networks can do this in the image domain as part of this lecture but specifically

neural networks will allow us to learn these visual features from visual data if we construct them cleverly and the key Point here is that actually the models and the architectures that we learned about in yesterday's lecture and so far in this course we'll see how they're actually not suitable or extensible to today's uh you know problem domain of images and how we can build and construct neural networks a bit more cleverly to overcome those issues so maybe let's start by revisiting what we talked about in lecture one which was where we learned about fully connected

networks now these were networks that you know have multiple hidden layers and each neuron in a given hidden layer is connected to every neuron in its prior layer right so it receives all of the previous layers inputs as a function of these fully connected layers now let's say that we want to directly without any modifications use a fully connected Network like we learned about in lecture one with an image processing pipeline so directly taking an image and feeding it to a fully connected Network could we do something like that actually in this case we could

the way we would have to do it is remember that because our image is a two-dimensional array the first thing that we would have to do is collapse that to a one-dimensional sequence of numbers right because it's a fully connected network is not taking in a two-dimensional array it's taking in a one-dimensional sequence so the first thing that we have to do is flatten that two-dimensional array to a vector of pixel values and feed that to our Network in this case every neuron in our first layer is connected to all neurons in that input layer

right so in that original image flattened down we feed all of those pixels to the first layer and here you should already appreciate the very important notion that every single piece of spatial information that really defined our image that makes an image and image is totally lost already before we've even started this problem because we've flattened that two-dimensional image into a one-dimensional array we've completely destroyed all notion of spatial information and in addition we really have a enormous number of parameters because this system is fully connected take for example in a very very small image

which is even 100 by 100 pixels that's an incredibly small image in today's standards but that's going to take 10 000 neurons just in the first layer which will be connected to let's say 10 000 neurons in the second layer the number of parameters that you'll have just in that one layer alone is going to be 10 000 squared parameters it's going to be highly inefficient you can imagine if you wanted to scale this network to even a reasonably sized image that we have to deal with today so not feasible in practice but instead we

need to ask ourselves how we can build and maintain some of that spatial structure that's very unique about images here into our input and here into our model most importantly so to do this let's represent our 2D image as its original form as a two-dimensional array of numbers one way that we can use spatial structure here inherent to our input is to connect what are called basically these patches of our input to neurons in the hidden layer so for example let's say that each neuron in the hidden layer that you can see here only is

going to see or respond to a certain uh set or a certain patch of neurons in the previous layer right so you could also think of this as almost a receptive field or what the single neuron in your next layer can attend to in the previous layer it's not the entire image but rather a small receptive field from your previous image now notice here how the region of the input layer right which you can see on the left hand side here influences that single neuron on the right hand side and that's just one neuron in

the next layer but of course you can imagine basically defining these connections across the whole input right each time you have the single patch on your input that corresponds to a single neuron output on the other layer and we can apply the same principle of connecting these patches across the entire image to single neurons in the subsequent layer and we do this by essentially sliding that patch pixel by pixel across the input image and we'll be responding with you know another image on our output layer in this way we essentially preserve all of that very

key and Rich spatial information inherent to our input but remember that the ultimate task here is not only to just preserve that spatial information we want to ultimately learn features learn those patterns so that we can detect and classify these images and we can do this by waiting right waving the connections between the patches of our input and and in order to detect you know what those certain features are let me give a practical example here and so in practice this operation that I'm describing this patching and sliding operation that I'm describing is actually a

mathematical operation formerly known as convolution we'll first think about this as a high level supposing that we have what's called a four by four pixel patch right so you can see this 4x4 pixel patch represented in red as a red box on the left hand side and let's suppose for example since we have a 4x4 patch this is going to consist of 16 different weights in this layer we're going to apply this same four by four let's call this not a patch anymore let's use the terminology filter we'll apply the same 4x4 filter in the

input and use the result of that operation to define the state of the neuron in the next layer right and now we're going to shift our filter by let's say two pixels to the right and that's going to define the next neuron in the adjacent location in the future layer right and we keep doing this and you can see that on the right hand side you're sliding over not only the input image but you're also sliding over the output neurons in the secondary layer and this is how we can start to think about convolution at

a very very high level but you're probably wondering right not just how the convolution operation works but I think the main thing here to really narrow down on is how convolution allows us to learn these features these patterns in the data that we were talking about because ultimately that's our final goal that's our real goal for this class is to extract those patterns so let's make this very concrete by walking through maybe a concrete example right so suppose for example we want to build a convolutional algorithm to detect or classify an X in an image

right this is the letter X in an image and we hear for Simplicity let's just say we have only black and white images right so every pixel in this image will be represented by either a zero or a one for Simplicity there's no grayscale in this image right and actually here so we're representing black as negative one and white as positive one so to classify we simply cannot you know compare the left hand side to the right hand side right because these are both X's but you can see that because the one on the right

hand side is slightly rotated to some degree it's not going to directly align with the X on the left hand side even though it is an X we want to detect x's in both of these image so we need to think about how we can detect those features that define an x a bit more cleverly so let's see how we can use convolutions to do that so in this case for example instead we want our model to compare images of this x piece by piece or patch by patch right and the important patches that we

look for are exactly these features that will Define our X so if our model can find these rough feature patches roughly in the same positions in our input then we can determine or we can infer that these two images are of the same type or the same letter right it can get a lot better than simply measuring the similarity between these two images because we're operating at the patch level so think of each patch almost like a miniature image right a small two-dimensional array of values and we can use filters to pick up on when

these small patches or small images occur so in the case of x's these filters may represent semantic things for example the diagonal lines or the crossings that capture all of the important characteristics of the X so we'll probably capture these features in the arms and the center of our letter right in any image of an X regardless of how that image is you know translated or rotated or so on and note that even in these smaller matrices right these are filters of weights right these are also just numerical values of each pixel in these mini

patches is simply just a numerical value they're also images in some effect right and all that's really left in this problem and in this idea that we're discussing is to Define that operation that can take these miniature patches and try to pick up you know detect when those patches occur in your image and when they maybe don't occur and that brings us right back to this notion of convolution right so convolution is exactly that operation that will solve that problem convolution preserves all of that spatial information in our input by learning image features in those

smaller squares of regions that preserve our input data so just to give another concrete example to perform this operation we need to do an element wise multiplication between the filter Matrix those miniature patches as well as the patch of our input image right so you have basically think of two patches you have the weight Matrix patch the thing that you want to detect which you can see on the top left hand here and you also have the secondary patch which is the thing that you are looking to compare it against in your input image and

the question is how how similar are these two patches that you observe between them so for example there was this results in a three by three Matrix because you're doing an element-wise multiplication between two small three by three matrices you're going to be left with another three by three Matrix in this case all of the Matrix all of the elements of this resulting Matrix you can see here are ones right because in every location in the filter and every location in the image patch we are perfectly matching so when we do that element-wise multiplication we

get ones everywhere the last step is that we need to sum up the result of that Matrix or that element-wise multiplication and the result is let's say 9 in this case right everything was a one it's a three by three Matrix so the result is nine now let's consider one more example right now we have this image in green and we want to detect this filter in yellow suppose we want to compute the convolution of this five by five image with this three by three filter to do this we need to cover basically the entirety

of our image by sliding over this filter piece by piece and comparing the similarity or the convolution of this filter across the entire image and we do that again through the same mechanism at every location we compute an element-wise multiplication of that patch with that location on the image add up all of the resulting entries and pass that to our next layer so let's walk through it first let's start off in the upper left hand corner we place our filter over the upper left hand corner of our image we element wise multiply we add up

all the results and we get four and that 4 is going to be placed into the next layer right this next layer again is another image right but it's determined as the result of our convolution operation we slide over that filter to the next location the next location provides the next value in our image and we keep repeating this process over and over and over again until we've covered our filter over the entire image and as a result we've also completely filled out the result of our output feature map the output feature map is basically

what you can think of as how closely aligned our filter is to every location in our input image so now that we've kind of gone through the mechanism that defines this operation of convolution let's see how different filters could be used to detect different types of patterns in our data so for example let's take this picture of a woman's face and the output of applying three different types of filters to this picture right so you can see the exact filter this is they're all three by three filters so the exact filters you can see on

the bottom right hand corner of the corresponding face and by applying these three different filters you can see how we can achieve drastically different results and simply by changing the weights that are present in these three by three matrices you can see the variability of different types of features that we can detect so for example we can design filters that can sharpen an image make the edges sharper in the image we can design filters that will extract edges we can do stronger Edge detection by again modifying the weights in all of those filters so I

hope now that all of you can kind of appreciate the power of you know number one is these filtering operations and how we can Define them you know mathematically in the form of these smaller patch-based operations and matrices that we can then slide over an image and these concepts are so powerful because number one they preserve the spatial information of our original input while still performing this feature extraction now you can think of instead of defining those filters like we said on the previous slide what if we tried to learn them and remember again that

those filters are kind of proxies for important patterns in our data so our neural network could try to learn those elements of those small patch filters as weights in the neural network and learning those would essentially equate to picking up and learning the patterns that Define one class versus another class and now that we've gotten this operation and this understanding under our belt we can take this one step further right we can take this singular convolution operation and start to think about how we can build entire layers convolutional layers out of this operation so that

we can start to even imagine convolutional networks and neural networks and first we'll take a look at you know what are called well what you ultimately create by creating convolutional layers and convolutional networks is what's called a CNN a convolutional neural network and that's going to be the core architecture of today's class so let's consider a very simple CNN that was designed for image classification the task here again is to learn the features directly from the raw data and use these learn features for classification towards some task of object detection that we want to perform

now there are three main operations to a CNN and we'll go through them step by step here but then go deeper into each of them in the remainder remainder of this class so the first step is convolutions which we've already seen a lot of in today's class already convolutions are used to generate these feature Maps so they take as input both the previous image as well as some filter that they want to detect and they output a feature map of how this filter is related to the original image the second step is like yesterday applying

a non-linearity to the result of these feature maps that injects some non-linear activations to our neural networks allows it to deal with non-linear data third step is pooling which is essentially a down sampling operation to allow our images or allow our networks to deal with larger and larger scale images by progressively down scaling their size so that our filters can progressively grow in receptive field and finally feeding all of these resulting features to some neural network to infer the class scores now by the time that we get to this fully connected layer remember that we've

already extracted our features and essentially you can think of this no longer being a two-dimensional image we can now use the methods that we learned about in lecture one to directly take those learned features that the neural network has detected and infer based on those learned features and based on what if they were detected or if they were not what class we're in so now let's basically just go through each of these operations one by one in a bit more detail and see how we could even build up this very basic architecture of a CNN

so first let's go back and consider one more time the convolution operation that Central a central core to the CNN and as before each neuron in this hidden layer is going to be computed as a weighted sum of its inputs applying a bias and activating with a non-linearity should sound very similar to lecture one in yesterday's class but except now when we're going to do that first step instead of just doing a DOT product with our weights we're going to apply a convolution with our weights which is simply that element-wise multiplication and addition right and

that sliding operation now what's really special here and what I really want to stress is the local connectivity every single neuron in this hidden layer only sees a certain patch of inputs in its previous layer so if I point at just this one neuron in the output layer this neuron only sees the inputs at this red square it doesn't see any of the other inputs in the rest of the image and that's really important to be able to scale these models to very large scale images now you can imagine that as you go deeper and

deeper into your network eventually because the next layer you're going to attend to a larger patch right and that will include data from not only this red square but effectively a much larger Red Square that you could imagine there now let's define this actual computation that's going on for a neuron in a hidden layer its inputs are those neurons that fell within its patch in the previous layer we can apply this Matrix of Weights here denoted as a 4x4 filter that you can see on the left hand side and in this case we do an

element-wise multiplication we add the outputs we apply a bias and we add that non-linearity it's the the core steps that we take and really all of these neural networks that you're learning about in today's and this week's class to be honest now remember that this element-wise multiplication and addition operation that sliding operation that's called convolution and that's the basis of these layers so that defines how neurons in convolutional layers are connected how they're mathematically formulated but within a single convolutional layer it's also really important to understand that a single layer could actually try to detect

multiple sets of filters right maybe you want to detect in one image multiple features not just one feature but you know in if you're detecting faces you don't only want to detect eyes you want to detect you know eyes noses mouths ears right all of those things are critical patterns that Define a face and can help you classify a face so what we need to think of is actually convolution operations that can output a volume of different images right every slice of this volume effectively denotes a different filter that can be identified in our original

input and each of those filters is going to basically correspond to a specific pattern or feature in our image as well think of the connections in these neurons in terms of you know their receptive field once again right the locations within the input of that node that they were connected to in the previous layer these parameters really Define what I like to think of as the spatial arrangement of information that propagates throughout the network and throughout the convolutional layers in particular now I think just to summarize what we've seen and how Connections in these types

of neural networks are defined and let's say how the how the output of a convolutional network is a volume we are well on our way to really understanding you know convolutional neural networks and defining them right that's the what we just covered is really the main component of cnns right that's the convolutional operation that defines these convolutional layers the remaining steps are very critical as well but I want to maybe pause for a second and make sure that everyone's on the same page with the convolutional operation and the definition of convolutional layers awesome okay so

the next step here is to take those resulting feature maps that our convolutional layers extract and apply a non-linearity to the output volume of the convolutional layer so as we discussed in the first lecture applying these non-linearities is really critical because it allows us to deal with non-linear data and because image data in particular is extremely non-linear that's a you know a critical component of what makes convolutional neural networks actually operational in practice in particular for convolutional neural networks the activation function that is really really common for these models is the relu activation function we

talked a little bit about this in lecture one and two yesterday the relative activation function you can see it on the right hand side think of this function as a pixel by pixel operation that replaces basically all negative values with zero it keeps all positive values the same it's the identity function when a value is positive but when it's negative it basically squashes everything back up to zero think of this almost as a thresholding function right thresholds is everything at zero anything less than zero comes back up to zero so negative values here indicate uh

basically a negative detection in convolution that you may want to just say was no detection right and you can think of that as kind of an intuitive mechanism for understanding why the relative activation function is so popular in convolutional neural networks the other common the other uh popular belief is that relu activation functions well it's not a belief they are extremely easy to compute and they're very easy and computationally efficient their gradients are very cleanly defined they're constants except for a piecewise non-nonlinearity so that makes them very popular for these domains now the next key

operation in a CNN is that of pooling now pooling is an operation that is at its core it serves one purpose and that is to reduce the dimensionality of the image progressively as you go deeper and deeper through your convolutional layers now you can really start to reason about this is that when you decrease the dimensionality of your features you're effectively increasing the dimensionality of your filters right now because every filter that you slide over a smaller image is capturing a larger receptive field that occurred previously in that Network so a very common technique for

pooling is what's called maximum pooling or Max pooling for short Max pooling is exactly you know what it sounds like so it basically operates with these small patches again that slide over an image but instead of doing this convolution operation what these patches will do is simply take the maximum of that patch location so I think of this as kind of activating the maximum value that comes from that location and propagating only the maximums I encourage all of you actually to think of maybe brainstorm other ways that we could perform even better pooling operations than

Max pooling there are many common ways but you could think of some for example or mean pooling or average pooling right maybe you don't want to just take the maximum you could collapse basically the average of all of these pixels into your into your single value in the result but these are the key operations of convolutional neural networks at their core and now we're ready to really start to put them together and form and construct a CNN all the way from the ground up and with cnns we can layer these operations one after the other

right starting first with convolutions non-linearities and then pooling and repeating these over and over again to learn these hierarchies of features and that's exactly how we obtained pictures like this which we started yesterday's lecture with and learning these hierarchical decompositions of features by progressively stacking and stacking these filters on top of each other each filter could then use all of the previous filters that it had learned so a CNN built for image classification can be really broken down into two parts first is the feature learning pipeline which we learn the features that we want to

detect and then the second part is actually detecting those features and doing the classification now the convolutional and pooling layers output from the first part of that model the goal of those convolutional pooling layers is to Output the high level features that are extracted from our input but the next step is to actually use those features and detect their presence in order to classify the image so we can feed these outputted features into the com the fully connected layers that we learned about in lecture one because these are now just a one-dimensional array of features

and we can use those to detect you know what class we're in and we can do this by using a function called a softmax function you can think of a softmax function as simply a normalizing function whose output represents that of a categorical probability distribution so another way to think of this is basically if you have an array of numbers you want to collapse and those numbers could take any real number form you want to collapse that into some probability distribution a probability distribution has several properties name that all of its values have to sum

to one it always has to be between zero and one as well so maintaining those two properties is what a soft Max operation does you can see its equation right here it effectively just makes everything positive and then it normalizes the result across each other and that maintains those two properties that I just mentioned great so let's put all of this together and actually see how we could program our first convolutional neural network end to end entirely from scratch so let's start by firstly defining our feature extraction head which starts with a convolutional layer and

here 32 filters or 32 features you can imagine that this first layer the result of this first layer is to learn not one filter not one pattern in our image but 32 patterns okay so those 32 results are going to then be passed to a pooling layer and then passed on to the next set of convolutional operations the next set of convolutional operations now will contain 64 features we'll keep progressively growing and expanding our set of patterns that we're identifying in this image next we can finally flatten those resulting features that we've identified and feed

all of this through our dense layers our fully connected layers that we learned about in lecture one these will allow us to predict those final let's say 10 classes if we have 10 different final possible classes in our image this layer will account for that and allow us to Output using softmax the probability distribution across those 10 classes so so far we've talked about right how we can let's say use cnns to perform image classification tasks but in reality one thing I really want to stress in today's class especially towards the end is that this

same architecture and same building blocks that we've talked about so far are extensible and they extend to so many different applications and model types that we can imagine so for example when we considered the CNN for classification we saw that it really had two parts right the first part being feature extraction learning what features to look for and the second part being the classification the detection of those features now what makes a convolutional neural network really really powerful is exactly the observation that the feature learning part this first part of the neural network is extremely

flexible you can take that first part of the neural network chop off what comes after it and put a bunch of different heads into the part that comes after the goal of the first part is to extract those features what you do with the features is entirely up to you but you can still Leverage The flexibility and the power of the first part to learn all of those core features so for example that portion will look for you know all of the different image classification domains that future portion after you've extracted the features or we

could also introduce new architectures that take those features and maybe perform tasks like segmentation or image captioning like we saw in yesterday's lecture so in the case of classification for example just to tie up the the classification story there's a significant impact in domains like healthcare medical decision making where deep learning models are being applied to the analysis of medical scans across a whole host of different medical imagery now classification tells us basically a discrete prediction of what our image contains but we can actually go much deeper into this problem as well so for example

imagine that we're not trying to only identify that this image is an image of a taxi which you can see here but also more importantly maybe we want our neural network to tell us not only that this is a taxi but identify and draw a specific bounding box over this location of the taxi so this is kind of a two-phase problem number one is that we need to draw a box and number two is we need to classify what was in that box right so it's both a regression problem where is the box right that's

a continuous problem as well as a classification problem is what is in that box now that's a much much harder problem than what we've covered so far in the lecture today because potentially there are many objects in our scene not just one object right so we have to account for this fact that maybe our scene could contain arbitrarily many objects now our Network needs to be flexible to that degree right it needs to be able to infer a dynamic number of objects in the scene and if the scene is only of a taxi then it

should only output you know that one bounding box but on the other hand if the image has many objects right potentially even of different classes we need a model that can draw a bounding box for each of these different examples as well as associate their predicted classification labels to each one independently now this is actually quite complicated in practice because those boxes can be anywhere in the image right there's no constraints on where the boxes can be and they can also be of different sizes they can be also different ratios right some can be tall

some can be wide let's consider a very naive way of doing this first let's take our image and start by placing a random box somewhere on that image for example we just pick a random location a random size we'll place a box right there this box like I said has a random location random size then we can take that box and only feed that random box through our convolutional neural network which is trained to do classification just classification and this neural network can detect well number one is there a class of object in that box

or not and if so what class is it and then what we can do is we could just keep repeating this process over and over again for all of these random boxes in our image you know many many instances of random boxes we keep sampling a new box feed it through our convolutional neural network and ask this question what was in the box if there was something in there then what what is it right and we keep moving on until we kind of have exhausted all of the the boxes in the image but the problem

here is that there are just way too many potential inputs that we would have to deal with this would be totally impractical to run in a real-time system for example with today's compute it results in way too many scales especially for the types of resolutions of images that we deal with today so instead of picking random boxes let's try and use a very simple heuristic right to identify maybe some places with lots of variability in the image where there is high likelihood of having an object might be present right these might have meaningful insights or

meaningful objects that could be available in our image and we can use those to basically just feed in those High attention locations to our convolutional neural network and then we can basically speed up that first part of the pipeline a lot because now we're not just picking random boxes maybe we use like some simple heuristic to identify where interesting parts of the image might be but still this is actually very slow in practice we have to feed in each region independently to the model and plus it's very brittle because ultimately the part of the model

that is looking at where potential objects might be is detached from the part that's doing the detection of those objects ideally we want one model that is able to both you know figure out where to attend to and do that classification afterwards so there have been many variants that have been proposed in this field of object detection but I want to just for the purpose of today's class introduce you to one of the most popular ones now this is a point or this is a model called rcnn or faster rcnn which actually attempts to learn

not only how to classify these boxes but learned how to propose those where those boxes might be in the first place so that you could learn how to feed or where to feed into the downstream neural network now this means that we can feed in the image to what are called these region proposal networks the goal of these networks is to propose certain regions in the image that you should attend to and then feed just those regions into the downstream cnns so the goal here is to directly try to learn or extract all of those

key regions and process them through the later part of the model each of these regions are processed with their own independent feature extractors and then a classifier can be used to aggregate them all and perform feature detection as well as object detection now the beautiful thing about this is that this requires only a single pass through the network so it's extraordinarily fast it can easily Run in real time and it's very commonly used in many industry applications as well even it can even run on your smartphone so in classification we just saw how we can

predict you know not only a single image per or sorry a single object per image we saw an object detection potentially inferring multiple objects with bounding boxes in your image there's also one more type of task which I want to point out which is called segmentation segmentation is the task of classification but now done at every single Pixel this takes the idea of object detection which bounding boxes to the extreme now instead of drawing boxes we're not even going to consider boxes we're going to learn how to classify every single Pixel in this image in

isolation right so it's a huge number of classifications that we're going to do and we'll do this well first let me show this example so on the left hand side what this looks like is you're feeding an original RGB image the goal of the right hand side is to learn for every pixel in the left hand side what was the class of that pixel right so this is kind of in contrast to just determining you know boxes over our image now we're looking at every pixel in isolation and you can see for example you know

this pixels of uh the cow are clearly differentiated from the pixels of the sky or the pixels of the the grass right and that's a Cree a key critical component of semantic segmentation networks the output here is created by again using these convolutional operations followed by pooling operations which learn an encoder which you can think of on the left hand side these are learning the features from our RGB image learning how to put them into a space so that it can reconstruct into a new space of semantic labels so you can imagine kind of a

Down scaling and then Progressive upscaling into the semantic space but when you do that upscaling it's important of course you can't be pulling down that information you need to kind of invert all of those operations so instead of doing convolutions with pooling you can now do convolutions with basically reverse pulling or expansions right you can grow your feature sets at every labels and here's an example on the bottom of just a code piece that actually defines these layers you can plug these layers combine them with convolutional layers and you can build these fully convolutional networks

that can accomplish this type of task now of course this can be applied in many other applications in healthcare as well especially for segmenting out let's say cancerous regions or even identifying parts of the blood which are infected with malaria for example and one final example here of self-driving cars let's say that we want to build a neural network for autonomous navigation specifically building a model let's say that can take as input an image as well as let's say some very coarse maps of where it thinks it is think of this as basically a screenshot

of you know the Google Maps essentially to the neural network right it's the GPS location of the map and it wants to directly inferred not a classification or a semantic classification of the scene but now directly infer the actuation how to drive and steer this car into into the future right now this is a full probability distribution over the entire space of control commands right it's a very large continuous probability space and the question is how can we build a neural network to learn this function and the key point that I'm stressing with all of

these different types of architectures here is that all of these architectures use the exact same encoder we haven't changed anything when going from classification to detection to semantic segmentation and now to here all of them are using the same underlying building blocks of convolutions non-linearities and pooling the only difference is that after we ex perform those feature extractions how do we take those features and learn our ultimate tasks so for example in the case of probabilistic control commands we would want to take those learned features and understand how to predict you know the parameters of

a full continuous probability distribution like you can see on the right hand side as well as the deterministic control of our desired destination and again like we talked about at the very beginning of this class this model which goes directly from images all the way to steering wheel angles essentially of the car is a single model it's learned entirely end to end we never told the car for example what a lane marker is or you know the rules of the road it was able to observe a lot of human driving data extract these patterns these

features from what makes a good human driver different from a bad human driver and learn how to imitate those same types of actions that are occurring so that without any you know human intervention or human rules that we impose on these systems they can simply watch all of this data and learn how to drive entirely from scratch so a human for example can actually enter the car input a desired dust Nation and this end-to-end CNN will actually actuate the control commands to bring them to their destination now I'll conclude today's lecture with just saying that

the applications of cnns we've touched on a few of them today but the applications of cnms are enormous Right Far Beyond these examples that I've provided today they all tie back to this core concept of feature extraction and detection and after you do that feature extraction you can really crop off the rest of your network and apply it to many different heads for many different tasks and applications that you might care about we've touched on a few today but there are really so so many in different domains and with that I'll conclude and very shortly

we'll just be talking about generative modeling which is a really Central class really central part of today's and this week's lectures series and after that later on we'll have the software lab which I'm excited for all of you to to start participating in and yeah we can take a short five minute break and continue the lectures from there thank you [Applause]