MIT Introduction to Deep Learning | 6.S191

279.59k views12405 WordsCopy TextShare
Alexander Amini
MIT Introduction to Deep Learning 6.S191: Lecture 1 *New 2025 Edition* Foundations of Deep Learning ...
Video Transcript:
good afternoon everyone and uh thank you for joining us today my name is Alexander amini and together with Ava we're going to be your instructors for the course this year this is uh MIT introduction to deep learning or 6cs 191 is our official course title now we're super excited to welcome you to this class and I think probably the a good place to start is always you know asking ourselves well what is MIT intro to deep learning this is a onewe boot camp on everything deep learning right so it's a both a very fun but
also a very intense one week because we're going to cover a ton of material in just the next five days now this is our eth year teaching this class and the pace of the field especially in the past couple years is really remarkable and every that we teach this class it's getting more and more I should say interesting to introduce this this lecture in particular and how we introduced this lecture has really start to adapt and evolve over the many years now many of you in the audience have probably even started to become almost a
bit I would say desensitized to a lot of the progress of deep learning in the past couple years because of this progress how rapidly this progress is happening so I think it's it's also important to not forget you know where we came from just a few years ago so I want to show you know this this image right here just to start this off and what better way to show you than for you to actually see the progress with your own eyes exactly one decade ago this was the state of a state-of-the-art deep learning based
facial generation system so this is not a real face this was the state-of-the-art model that could generate faces and this was the best that we could do fast forward you know just a few years uh down the pipeline and progress in image generation had already started to advance tremendously and here you can see you know you know a lot of more realism photo realism in these types of images being created and then you know fast forward another few years after that and and these images start to come to life right they start to have temporal
information they start to have video and uh you know movement as well to those uh to those images right and in fact this this video that you see on the right is a video that we created in this class uh some years ago and for those of you who haven't seen it already it's it's uh you know it's online and people have seen it but in case you haven't I'll play just the first 10 seconds I won't play the whole thing just so you can see it as well hi everybody and welcome to MIT 6s
one91 the official introductory course on deep learning taught here at MIT now I won't play the whole thing right but you get the gist of this video this this video was created five years ago we made it as part of this class and we we used it to actually introduce this class uh back then now when we did this in 2020 it got a lot of even back then right like especially back then maybe you guys aren't that impressed by it today but back then this video was a huge deal right this was a huge
jump in photo realism for uh for for capabilities of deep learning models and the clip went you know very uh very rapidly a bit viral and people commented a lot about the realism but actually one interesting thing that people didn't see they saw the end result but actually at that point what people didn't see was that for us to generate that clip which was a two-minute clip you only saw the first 10 seconds but that clip was 2 minutes in total and to generate that two-minute clip it cost around two hours of professional audio data
being recorded and captured uh of the speaker which was not us it cost around 50 hours of professional High def video data to build you know the face model and it required around $155,000 US dollars of compute to generate that two-minute video and all of that was just going to generate you know a predefined script right something static you couldn't talk to it you couldn't interact with it it was just a predefined script static not flexible at all but uh you know still a tremendous amount of resources both time data energy and and Financial Resources
as well and I wanted to start this class by basically asking this question of you know what would this look like today right that video was created in 2020 but today we're in 2025 and what would cloning you know and generative AI look like in today's world right especially understanding all of those limitations that we had back then in 2020 and I thought what better way to to to tell you about this than to just show you instead of you know all of that compute and resources that we spent in 2020 let's do it live
and unedited right in front of everybody today so in fact I have a bit of a secret so I'm going to share my other screen and move this over so this is a a terminal window that you can actually see something's been recording audio in the background as I've been talking I've been having this program that we made you know listening to my audio recording my voice and what I'm going to do is just uh you know pause the recording now and we're going to see how quickly it can actually immediately start to uh you
know just take my voice and do an instant clone as opposed to all of that time intensive resources that we saw before so I'll pause it right now you'll see it instantly clone and and yeah now now my voice is cloned and what we're going to do with this is actually try and not have a predefined script like we saw before I'm going to actually have a live and unedited conversation with this voice and see how Dynamic and flexible these types of generative AI models have become in today's world so let's try this out hello
can you hear me hi yes loud and clear how are you doing wait a second actually you sound a lot like me don't you that's interesting you're right we're even more similar than I thought what are we up to well actually I was just kicking off this year's MIT deep learning class and you know I'm I'm standing right in front of the audience right now uh MIT introduction to deep learning is such a great class what's the class like so far actually we just started uh I you know what I had an idea so I
was just about to introduce what is deep learning how about you take that part over for me deep learning is a type of machine learning that uses artificial neural networks to learn from data these networks are inspired by the structure of the human brain and are able to learn complex patterns from large amounts of data deep learning is used in a wide range of applications including image recognition natural language processing and self-driving cars it's pretty good actually you know I think the amazing thing I was telling everybody in the audience about was that the the
state of deep learning today is so flexible it's so Dynamic what do you think you're right deep learning is incredibly dynamic it's constantly evolving with new architectures algorithms and applications being developed all the time actually wait let me interrupt you right there just to show how Dynamic you I'm going to interrupt you and uh let's try this enough about deep learning how about you tell me a story about turtles and make it wild oh Turtles you want the wild turtle story okay picture this there's this Turtle named shell right he's not your average Turtle oh
no Shell's got dreams he dreams of flying he spends his days practicing flapping his little FP trying to get off the ground one day he meets this crazy inventor a mad scientist Turtle named Dr Sheldon who's built a rocket powered shell Shelly jumps in and boom they blast off into space they Zoom past planets meet Aliens who love to eat I'll interrupt you I'll interrup one more it's Turtle tastic oh let's end it here okay thank you so much this was a this was a cool way to introduce the class I'm glad I could help
okay thanks byebye [Applause] bye okay awesome okay so that's just a fun way to just to show you know how far we've come in this field you know just from a few years ago generating very static content and now you know live unedited extremely Dynamic content as well so you actually just heard a very brief introduction on what deep learning is and in fact in that demo uh and all of the progress that all of you have been seeing over the past many years you've been seeing what you'll see in this class over the next
one week is the fundamental techniques that drive all of that progress so let's just start by maybe laying some Foundation laying some groundwork on exactly you know uh what this field is all about and to do that I think first I have to introduce to you what is intelligence first of all right so to me uh the word intelligence means the ability to process information in order to inform some future decision right some future action this is what intelligence means so all of us exhibit this capability every single day you know some more than others
uh but artificial intelligence is just the practice of building algorithms artificial algorithms to do exactly that same process right use information use data to inform future decisions now machine learning what is machine learning machine learning is a subset of artificial intelligence that focuses on not explicitly programming the computer how to use that data how to process that information to inform that decision but just try to learn some patterns within the data to make those uh decisions and finally deep learning is just a subset of machine learning which focuses on doing that exact process with neural
networks deep neural networks and we'll learn EX exactly what what deep neural networks are throughout this class but really at at a high level right this entire course is about this core idea fundamentally right this is what we will teach and you will all get a very strong handle on throughout this entire week is you will learn how to teach computers how to learn how to do tasks directly from observation directly from data and we'll provide you both a solid foundation you know in the lectures but also through practice iCal understanding in software Labs as
well so you can get very handson and that's probably a good segue to tell you a little bit about the entire course just high level so this is going to be a combination like I said between the technical lectures and the software Labs we'll have several new updates this year in particular as we're you know as the field is advancing so quickly we're really going to try start to uh you know you know drive home a lot of key points especially in more of the modern side of deep learning and then to that end we'll
conclude with some guest lectures from industry leaders on state-of-the-art deep learning methods and AI methods that are being developed in industry and this will really start to advance your your knowledge even more in addition yes that's right also tonight we're going to have a reception uh at 4:30 and you all invited to that reception as well uh to you know talk to to everyone and and learn more about deep learning there's also food provided as well um this year we also have have a lot of great updates on the software Labs uh so we'll be
introducing both tensorflow and pytorch software labs and these are you know number one these are a great learning experience for all of you to get Hands-On with everything that you learn in the lectures but also they're a medium for you to enter into the competition prizes and make yourselves eligible for a lot of cash prizes uh at the end of this course so how exactly does that work each day we'll have a dedicated lecture uh and we'll have a dedicated software lab that mirrors that lecture and the the software Labb will just basically reinforce what
has been taught during the day in the lectures uh starting today you'll have lab one where you're going to basically focus on building a form of a a language model actually it'll be a very small language model but it's a next token predictor language model that learns how to generate music and predict the next token of music so you can generate novel folk songs uh and then tomorrow we'll move on to facial detection systems you'll get Hands-On with building your own computer vision system from scratch understanding also some automated techniques to fix imbalanced data in
those systems and then finally lab three is going to be a a brand new lab uh premiering this year for the first time on large language models and you're actually going to in that part of that lab fine-tune a two billion parameter large language model uh on uh you know on compute that you'll control uh in a mystery style and you'll also build a AI judge to evaluate the quality of that language model so all three of these labs are going to be you know a lot of a lot of fun and then finally on
the last day of the class we'll have a final project pitch competition uh each group can groups I think of uh up to three to five people and each group is presenting up to uh three to five minutes kind of in a shark tank style pitch competition and then you'll be eligible for even more prizes as part of that as well okay uh I won't go through this slide there's many great resources available as part of this class uh this slide as well as the entire lectures are all posted online you can already check the
website they should be online already and if you ever need any help please post a Piaza if you have any questions we have a team of incredible Tas and instructors this year that you can reach out to at any time for any questions or issues myself and a will be your two main lectures for most of the course but then you also be hearing from a lot of guest lectures uh throughout the rest of the class uh here are some of the names this course in general would not have been possible uh each of these
years without all of our amazing sponsors so I do want to give a huge thank you for all of their support over the years okay so now that we've gone through all of that I want to start with a lot of the the fun stuff yeah go ahead sure yeah that's right yeah so this course has been Tau t for eight years we've taught it to over around now 13 million people uh so and just at MIT alone because MIT you know that's the global audience the MIT audience is around probably 3,000 at this point
and every year online around 100,000 people take this class so you're you're in great company and a lot of really amazing people have taken this class and we're really excited for all of you to be here today so I want to start now as we dive into the technical part of this class I want to start by you really asking this fundamental question why deep learning and why now and hopefully this is a question all of you has asked before you came here today uh you know understanding exactly what gets at the basis of deep
learning is really important so that we can understand how we can move forward and build even better algorithms that drive this field so traditional machine learning maybe if we start there for a second traditional machine learning typically defines what are called sets of features I'll tell you more about that word in a second but usually what the these features are these are basically rules of how to do a task step by step right and the problem is that if we as humans Define those features uh we're not usually very good at building very robust features
so for example let's say I wanted to tell you or I told you to build a AI model that could you know detect faces how would you do this what features would you build in an image to detect faces well what you could do is is you could uh you know start by first detecting lines in the image just edges right very simple lines then you could start to compose those lines together to detect things like uh you know curves and edges and and uh you know uh you know curves basically yeah curves of lines
not just straight lines and then you can combine those together to start to form more composite objects right like eyes and noses and ears and then from you can actually start to build up structures of faces why would you do it like this well it's actually naturally hopefully this is the way that you would also think of doing it because it's very hard to immediately just one shot detect a face you actually don't process faces like this first of all you actually start by processing much more Co features the low-level features first or excuse me
the high level features first then you compose these together to really form your own intuition about a face right now the key idea of deep learning is no different than than this process the key idea is to learn these features instead of me telling you or you telling me exactly what those features are the key idea of deep learning is to say after observing a lot of faces can I learn that I should first detect things in this hierarchical fashion step by step you know first detect the lines then detect the curves then detect the
you know the composits like eyes noses and ears and then build up to facial structure like this and it turns out this is exactly what deep learning is able to do and we'll see how how this is uh being done underneath the hood throughout this lecture it's really important to understand though that you know even though we are seeing so many of these amazing things of deep learning over the past few years everything that you'll learn especially in today's lecture this is an intro lecture so almost everything that you'll see today has been invented or
developed uh decades ago right this is not new thing new things that we'll be showing in today's lecture tomorrow and the day after and after that you'll start to see a lot more of the recent Advan but why are we seeing this all today right the reason is because number one we see an explosion of these techniques even the techniques that are decades old because of three key components number one is data right data is becoming more and more plentiful throughout the world and this is really driving deep learning progress the compute is number two
right compute is becoming more and more powerful and more and more commoditized GPU architectures especially are driving the progress in learning and gpus were you know uh you know only recently starting to be commoditized and finally open source toolboxes like you see on the right hand side tensorflow P torch caras and so on you know make it very very streamlined and very easy for all of you just in a onewe course to get Hands-On with these architectures and start to build directly so let's start by you know just understanding the fundamental building blocks of every
neural network and that's just a single neuron or a perceptor right so what is a perceptron the idea of a perceptron or a single neuron is is really simple right so let's start by just taking and defining a perceptron purely by its forward propagation of information so given some inputs how does a perceptron compute an output let's start by defining you know a set of inputs X1 to XM and each of these numbers each of these inputs will be multiplied by a corresponding weight W1 through WM what we're going to do is after we do
this multiplication we're going to add up all of those numbers together we'll take the single number that comes out and then we'll pass it through what's called a nonlinear activation function this is just a nonlinear one-dimensional function that you can pass through uh this single number on the output here it's denoted as G okay I left that one Minor Detail so I'll correct it right now so the one thing that we have to remember is that after we multiply all of our weights by our inputs we're also going to add this one number called a
bias term and the bias term is effectively if you look at the equation it's a way for us to shift left and right along our activation function G so this is just a shifting scalar appro like designed within the equation here now on the right hand side of this equ of this slide you can actually see the diagram on the left mathematically Illustrated or mathematically written as a single equation right now I'm going to now rewrite this for the sake of cleanliness uh using linear algebra in terms of vectors and Dot products so let's do
that now so now instead of you know X1 through XM I'm going to write just a vector capital x capital x is going to be a collection of all of my inputs and capital W will be a collection or a vector of all of my weights the output then Y is simply just going to be obtained by having a do product between X and W adding our bias and passing through G passing through a nonlinearity now you might be wondering you know I've mentioned this nonlinearity a few times what is this thing uh well I
said it's a nonlinear function right but what exactly is it one common example here would be What's called the sigmoid function so the sigmoid function you can see right here it's it basically can act over any real number on the x axis but it outputs only between Z and one so usually the sigmoid function is really good for things like probabilities if you wanted to convert your output of your perceptron your neuron to a probability but in fact actually there's many types of nonlinear function it's just one function that's commonly used in neural networks but
throughout this presentation you'll see basically a few examples of different nonlinear functions uh and also I'll point out on the bottom of this slide you can see some code Snippets both in tensorflow and P torch that will help kind of like uh you know align what you're seeing in the math with also code that will be relevant for some of your software Labs later today the sigmoid function that you saw earlier this output's very good for probabilities you'll also see things like on the right hand side this is called the rectified linear unit this outputs
uh things that are strictly positive it is piecewise linear so it is linear before zero and is linear after zero but it has a single nonlinearity at x equals 0 now why do we need activation functions actually that's a question hopefully everyone here asked it seems like unnecessary if at its first glance the point of an activation function is actually quite simple it's to introduce nonlinearities into your model right without a nonlinear activation function you have a linear model so why do we want nonlinearities well it's just because real data in the real world is
heavily nonlinear right now you might be maybe just a good example to to show this would be let's say I show you this picture here and if I asked you to build a classifier a separation draw a single line that separates the red points from the green points can you draw that line at first glance yes you could draw the line but what if I told you that it had to be a straight line right if I told you it's a straight line then it's not really possible anymore to do this task well so that
that makes the problem really hard that's the problem with having a linear model the benefit of having nonlinearities is that it allows us to approximate arbitrarily complex functions with enough depth in our model this is exactly what makes neural networks nonlinear neural networks extremely powerful let me just help you all understand this with a simple example so imagine I gave you now a trained neural Network that you can see here it has one perceptron right but it has two inputs X1 and X2 it also has two weights W1 and W2 and it also has this
bias term on the top as well now how would we how would we process this information it's the same story as before we're going to compute a DOT product we're going to add the bias and pass it through our nonlinearity G now if we plug in our data we already know our inputs here our inputs are going to be uh let's see it's NE it's positive 3 is the first input X1 and -2 is our second input X2 we can plug these into our equation along with our weights as well and we can see actually
uh that we can obtain this line which is going to be a two-dimensional line that parameterizes our entire function space of this neuron since it's only in two Dimensions we can even plot this line so we can say exactly how this whole Space would look like and for any new input that this model sees where with respect to this line would it fall so let's say for example that if I had this new Point here the point is on this x1x2 space it's going to be at point1 by 2 and we can see graphically exactly
where on this plot it Falls with respect to the line now also we can plug it back into our equation as well and we can see exactly you know okay if we plug in one or excuse me negative one as our input to X1 positive two to our input of X2 we can plug it into the equation on the bottom left we pass it through the nonlinearity the nonlinearity here is a sigmoid function it squashes everything to be Zer and one depending on which side of the line it falls on and we get this Final
Answer here in this case the final answer is 0.2 right this is less than 05 .5 is going to be the divider because all of our outputs are going to be separated between zero and one and we can actually graphically represent this as well so if you fall directly on the line your output after nonlinearity is going to be 05 the more to the blue side that you fall the farther under 0.5 you are and the more to the green side you fall the farther above 0.5 you fall so the line basically represents the point
of separation between these two sides of the space And depending on which side of your input you fall on this is the way for you to classify this point as either a positive point or a negative point so now it's also important to understand you know we just did this for a single neuron with two inputs you can imagine that if you had a model with you know many more inputs than two it would no longer be possible to draw this plot and this is something that we'll have to deal with in terms of uh
understanding and building intuition but hopefully even at the small uh scale you can build some level of intuition even with this plot let's see how we can now start to tie some of this together to go beyond just one neuron and start to build networks right because this is where we actually build really powerful systems it's not just from one neuron but from Full networks so to do that let's just revisit our diagram one more time if there's a few things that you take away from this class this is hopefully the slide that you take
away from right and I've said it already a few times I'll say it one more time how do you pass information through a neuron you take a DOT product you apply a bias and you apply a nonlinearity it's these three steps and these keep getting repeated over and over again I will simplify the diagram since I've now told it to you so many times hopefully it's starting to stick now I'm going to remove all of the weights from this diagram and I'm also going to remove the bias term so now you can always assume that
those two things are there I'll just remove them from the illustration moving forward to keep things cleaner now Z here Z is going to be the result of that term it's going to be the result of the dot product plus the bias it's going to be before the nonlinearity G okay so we will then pass Z through G and that will give us Y and you can see that represented right here okay now what if we wanted a multioutput neural network not one output but two outputs how would we change this picture okay actually is
pretty simple we now just create a second perceptron we now have two neurons instead of one neuron both neurons have the exact same inputs but because their weights are different they will have two different outputs so they both take as onep put the same information they process it their own way with their own weights and they make two different outputs from scratch now these types of NK these types of layers let me call them a layer are typically called dense layers because everything in my inputs is connected to everything in my outputs and if you
exclude the nonlinearity this is also a linear layer right this is a linear layer because it takes all of my inputs X and just linearly operates them with my weights W and add a bias which is also a linear operation so we can now actually implement this entire operation from scratch in Python so let's try it out so we're going to start by just defining those two weights we Define self. w this is our weight vector and we also have self. B which is our bias scaler this is just one number but here since it's
an entire uh n dimensional output we'll actually have n uh uh n neurons in the output as well when we want to do our forward pass through this through this layer how do we do this it's the same story as before we take a DOT product which here is this multiply we add the bias and then we apply our nonlinearity this is the sigmoid here but you could change this to any nonlinearity in P torch you can actually see that there's almost perfect analog to on the left and the right side it's the same story
here you create your your two weights your weight and your bias you apply a matrix multiply add the bias and apply your nonlinearity exactly the same as before now luckily tensorflow and pytorch have already implemented this type of dense or linear layer for us so we don't need to do that that was just a good learning exercise we just went through but here you can just call it right you can see the function calls on the bottom now let's take a look at a single layered neural network a single hidden layered so not where the
output is directly from a single perceptron but where we have to actually pass through two layers okay what does that look like so this is one where we have the single layer layer single hidden layer is placed basically between our output layer and our input layer why do we call it hidden well it's just because we don't directly observe the data that happens in this layer right input layer is data that we provide to the model the output layer is typically things that we would supervise over the hidden layer is one that is learned over
the course of uh just observ observable uh data since we now have a transformation both from inputs to Hidden and from hidden to outputs we now have two layers right and that's also going to mean that we need two weight matrices W so we'll actually have a W1 on the left hand side and a W2 on the right hand side now if we look at a single unit a single perceptron a single neuron in that hidden layer let's take Z2 for example it's just a perceptron that we saw before it's the same story nothing C
has changed its answer is computed by taking a DOT product adding a bias and passing it through this nonlinearity if we took a different node a different neuron let's say like Z3 the one right below it it would also be computed with a do product bias nonlinearity it would take the same inputs as Z2 but it would have different weights so the dot product and the bias would be different this picture again looks a bit messy so I'm going to simplify it even more I'm going to replace all of the Arrows with the single icon
here the simple uh the symbol which is just going to denote uh you know this dense layer this this linear connection layer that is happening between these two components and again we can see that to build a network like this in tensorflow pytorch these convenience functions are really starting to uh you know help us a lot because we don't have to implement a lot from scratch now if we wanted to create a deep neural network how would we do that what is a deep neural network it is nothing more than just sequentially stacking more and
more these linear layers followed by nonlinearities followed by more linear layers followed by more nonlinearities over and over and over again in a hierarchical fashion so this is just a model where the final output is created as a hierarchical combination of going deeper and deeper into these linear followed by nonlinear uh operations yes please here to give us exactly yes of course a question was about maybe just for a quick real world example of why we would have different layer so different layers on the depth axis but also different outputs on the you know up
down axis right so different layers on the xaxis basically this corresponds to more depth more complexity in your network right so for more complex tasks you would want more depth because you're introducing more hierarchical nonlinearity after one layer if you have a single uh you know dense connection followed by a nonlinearity you have a limited amount of complexity that you can extract is only coming from one nonlinearity so it's limited to the expressive capacity of that single nonlinearity so for as you get more and more complex tasks you require deeper and deeper expressive functions so
that's this one axis on the other axis more outputs this is just a problem definition so if you wanted to predict more things then you would need more outputs a good example is that if you wanted to do generation right let's say to generate an image you would need to generate values for every pixel in that image that's a lot of outputs right versus if you want to just predict let's say uh you know the weather tomorrow that's a temperature value right it's just one output it's a number right so depending on your problem definition
those two things can change excellent okay so now that we have an idea of architecturally what makes up a neural network I think now it's time that we can actually start to compose all of this together and actually in line with that that example that question that just came up let's try and go through an example of applying some of this Theory into practice and actually understand you know uh how we can look to apply a neural network to solve a very real problem let's say maybe not that real but maybe real for all of
you that you've been thinking about it so here's a question that maybe all of you have been asking yourselves you know will I pass this class uh and let's try and build a neural network that can infer or predict this answer for all of you so we're going to do this by building a very simple model it will take us input two inputs uh one output the one output is going to be will I pass this class yes or no so single number probability of passing the class and it will be two inputs defined by
number one how many lectures you attend over the course of this one week and number two the number of hours that you spend on your final project okay so let's let's plot because we've taught this class for many years we actually have data from past students on this operation we can look at all of the green points are people that have passed the class all the the red points are people that have not and we can also plot where you are or you can guess how many hours you're going to spend on this class how
many hours you're going to spend on the the final project and what we want to do is build a neural network that will determine from all of this past data of all of these students where will you fall on this probability chance of passing versus not passing so let's do it okay we've we've actually learned all of this so far in the class so let's take it step by step we have two inputs this is this new person right you have spent four day four lectures you've attended and you spent five hours on your final
project those two numbers you can feed in as input on the left hand side to your model we also have a single layered neural network a very basic neural network where we're just going to start with this for now and we're going to see that our hidden layer has three hidden units and our output will just have one output which is a binary output yes or no on passing the class or not and what we're going to see is actually that this model got the answer very wrong it predicted that you would pass the class
with probability 0.1 or 10% when in reality actually you did very well you you definitely passed the class so can anyone tell me you know why you think that this network failed so badly here yes exactly exactly yes so the answer was it's not trained and that's exactly right so the the model here hasn't seen any of this data that we showed on the previous slide right it's it's basically like a baby that has not seen any knowledge about the the real world it doesn't know anything about this problem as well it needs to First
learn about this problem and this is something that we haven't talked about so far in order to train our model our model has to also understand when it makes bad predictions what does a bad prediction mean it means that it has to be able to quantify how bad a prediction is versus how good a prediction is this is called the loss of a neural network a neural Network's loss is just going to be uh a measure of how far apart its predictions are from the ground truth answers or the ground truth observations of of a
piece of data the closer your or the smaller your losses it means the closer these two things are so your predictions are really matching the ground truth this results in a small loss now let's assume that our data is not just from one student but we actually have past data from many students right now we want to Care on how the model is doing not just for this one student but aggregate empirically across the entire past class now this is what we call is training on not just a single data point but we train on
an entire data set so when we train neural networks we want to find neural networks that minimize our loss or maximize our accuracy not just on one student but on the aggregate empirical data set this is called the empirical loss and it's just simply the average of my loss for every data point in my data set now right now we've been focusing on this problem of binary classification yes no answers and for those types of losses we can use the what's called a softmax cross entropy loss we'll learn more about this later but this is
measuring the difference or the distance between two probability distributions two binary probability distributions now let's just suppose that instead of predicting an output of uh binary output we want to predict a final output that is a real number like a continuous value so let's say like a grade A percentage grade instead of will I pass the class or will I not but a percentage grade of how well I'll do for doing something like that we won't be able to use a binary loss anymore so we'll have to actually change our loss we can change it
for example to a mean squared error loss so we can take our two grades predicted grade and true grade subtract them and then Square them to create a distance measure and these are roughly the two types of losses that you'll see both categorical discreet losses like binary losses as well as continuous losses like MSE losses of course there are so many other losses that you'll get Expos two over the class but these two are uh having very wide coverage in the field okay let's put all of this information together and now start uh talking about
the problem of actually finding our weights of the network right we've talked about defining the network we've talked about uh basically penalizing the network when it gets something wrong we have not talked about how to actually improve the network or train the network so let's talk about that as in this next part the objective here what are we trying to do ultimately at the end of the day throughout this entire class is that we're trying to find and build networks or models build models that minimize the loss on a data set the loss measures this
difference between predicted and true we want to minimize we want to find a network that minimizes the loss on a data set this means mathematically right walking through this equation it means that we want to find the W's we want to find the weights that will result in the minimum L the loss over the entire data set from 1 to n now remember that W weights is this is just going to be a collection of all of the weights in our entire model so it's the weights from every single layer in our Network we're just
going to combine those into all of One Piece and those are the weights that we're going to try and optimize over now how do we do this optimization procedure well remember that our loss function is just a function of our weights given a set of Weights our loss function will return a single value it is how how far apart our predicted answers are from our true answers if we only had two weights in our Network then we would be able to plot our lost landscape like a picture like this we would be able to plot
it in a grid of data over weights one and weights two and for every configuration of my weights a be able to see how much error or how much loss that configuration of Weights is obtaining now what we want to do is basically find the lowest point on this landscape we want to find which weights one and weights two correspond and give us the smallest loss so how can we do this well we can start at some random point we pick a random point in our landscape any point and we start from this point and
what we'll do is we will compute What's called the gradient the gradient will tell us which way is up from this point right it's a local measure it only tells us locally from where I stand right here which way is up and what I'll do is I will take a small step in the opposite direction right and I'll take a small step going down that loss and then I will repeat this process over and over and over again until I finally get to the bottom of the Mount of the hill right and I converge at
what's called a local minimum we you can summarize this this algorithm this procedure as uh as what's known as gradient descent in pseudo code right so let's go through it again one more time very briefly we start by randomly initializing our weights this means that we randomly pick a place in our landscape we compute the gradient here called DJ DW this is how much a small change in our weights changes our loss right so this tells us the direction that we should change our weights in order to increase our loss we take a small step
in the opposite direction so here you can see that actually we take that gradient we we multiply by negative one we go in the opposite direction of that direction and then we multiply it by a small uh step let's call it Ada Ada here is going to be a step size of how much in that direction we actually move and then we repeat this in a loop over and over again in tensor flow right you can see this exactly represented the same way but here I want to draw your attention to this term right this
is the direction term it tells us the gradient the gradient is going to tell us how or which direction is going up or which direction is going down if you take the negative of it but I never actually told you how to compute this right I just told you that we need to compute this right um the process of computing the gradient in a neural network is called back propagation so I think it would be helpful also we can take a quick you know stepbystep example walking us through how back propagation works and how you
would compute this gradient for a particular neural network and we'll start just for demonstration we'll start with the simplest neural network that that exists it it consists of one input one output and one hidden neuron in the middle right so you cannot get a simpler Network than this and we want to compute the gradient of our loss L at the end or excuse me here is J at the end with respect to let's start with with respect to W2 okay so how much does a small change in W2 affect our loss so we can write
out this derivative right we can write it out in math and we can use the chain rule to actually decompose it now why would we want to decompose it well first of all we decompose this this uh gradient DJ dw2 into two terms DJ a Dy and Dy dw2 this is just a basic extension of the chain rule nothing magic here but why is this possible it is possible because Y is dependent only on the previous layer okay now let's suppose now that we wanted to compute the gradients of this weight before W2 let's say
W1 here what we can do is just replace W2 in this equation with W1 and then again we have to apply the chain rule yet again right because Computing this last term here is not well defined so we have to actually expand it one more time this is why we call it propagation back propagation because you actually have to start from the output and keep Computing these iterative chain rules back and back over the course of your network step by step and we repeat this process of you know propagating those gradients all the way from
output to input across our weights and at the end of this whole process what we're left with is for every single weight in our Network we have this direction of basically saying okay if we increase this weight a little bit will our loss go up or down now if our loss was to go down that means that we should increase that weight just a little bit right or we would go in the opposite direction and that's the back propagation algorithm right in theory it's it's nothing more than an an application of the chain rule from
differential calculus but but in practice you know it can get very messy and very hairy it's a very computational me measure to do because you have to do this you know step by step for every single we in your in your model uh so in practice today's deep learning Frameworks like tensorflow pytorch they do this automatically so you don't necessarily need to implement this yourself but it's important to understand you know the the the pra the theoretical side of you know how these things are operating and and what it's doing underneath the hood I want
to also like use that as an opportunity to discuss with you some of the Practical implications of training neural networks uh in in reality right and I showed you this previous picture of like a very pretty lost landscape that was very smooth but in practice optimizing neural networks is extremely difficult and this is actually a picture of you know neural networks are extremely high-dimensional search spaces so we don't actually know what this picture looks like but this is a projection of the Lost landscape of a of deep neural network uh from a paper that came
out several years ago about in 2017 and you can actually visualize now you know how messy some of these loss Landscapes look that applying these types of back propagation and optimization techniques is very very challenging and I want you to also recall you know before we took that dive into back propagation and the gradient term in particular we started to talk about uh you know this this equation that you see here right so how would we update the weights we update them by by taking an opposite step in a small uh small increment in that
direction that we want to right now this is the key term I want to focus on now this small step this is called The Learning rate of our model this D this basically dictates how quickly we take those steps and how quickly we listen to our to our gradients as we're Computing back propagation and in practice setting the learning rate can be very very difficult if we set the learning rate too slow then we basically start from point but we get stuck in some of these uh local minimum but they may not be the best
minimums that we could get to right if we set it too large then we get some unstable Behavior where we basically overshoot we we start to step in the right direction but we step too far and then we explode out of the out of the stable place of of learning ideally we want to set learning rates that are you know not too small so that they can skip some of the local Minima but also not too big that they also diverge and they can so converge so how do we actually set the learning rate one
option and actually a very common option is to uh you know just try a bunch of learning rates see what works best how can you do better than this well the idea is uh can you design adaptive algorithms that depending on how they are uh optimizing in the search space can you adapt the learning rate can you change the learning rate as a function of your landscape itself and this basically means that your learning rate practically speaking your learning rate will increase or decrease as a function of your gradients and a function of your data
uh how fast you're learning right how how how steep the uh landscape is how how you all of these different things can basically dictate all of these adaptive properties of a learning rate and in fact these have been very widely studied and many different types of adaptive learning rate schedulers have been created here you can see some examples adom so all of these start with like a lot a lot of them start with this Ada for adaptive right these are different variations of these adaptive properties uh atom in particular is one extremely well used uh
type of optimization procedure that you'll be using throughout many of your Labs but I encourage you to really try out and and experiment with all of these different types of learning rate schedulers to see what works best in many times there will be different types of learning rate schedulers that work for different types of problems so you should definitely try out the different pieces and trying them out is is as easy as in oftentimes just a single line change right change to your uh learning Loop will just Implement different schedulers so SGD stochastic gradient descent
is just going to be that that base gradient descent algorithm that we had seen before and I actually want to dig into that a little bit more because what you saw or what I presented was actually the gradient descent algorithm not the stochastic gradient descent algorithm so I want to tell you a little bit about you know what's the difference between those two pieces with those two types of algorithms to understand that we have to first revisit one more time the gradient descent algorithm so the gradient here this is that that piece that we computed
with back propagation this is very computational because if you look at it it's computed as a summation or an average I should say over all of my data points in my data set so I compute the the gradient for not just one data point but all of my data points in my data set that's why it's very expensive now in most real life problems it is not really feasible to compute your gradient over your entire data set on every single iteration of this step because remember we don't compute the gradient just once we compute it
at every point along this optimization procedure and you're optimizing your your network for millions or even more steps and you don't want to be looping through your entire data set on every single one of those steps so let's define a new type of gradient descent now we'll call it stochastic gradient descent like you saw before instead of computing the gradient over my entire data set I'm going to compute a very noisy gradient it's going to be a gradient computed just over one data point in my data set so I'm going to randomly pick a data
point and I'm going to compute the gradient with respect to that one data point not my entire data set this is going to be way noisier here obviously because that one data point is not going to be representative of my entire data set but it'll give me an answer way quicker so I can get through more steps now there's also a uh you know there's a natural trade-off here right we want to go fast but we also don't want to be too noisy there obviously is a middle ground here right instead of computing the noisy
gradient on one example we can do what's called mini batched gradient descent right mini batched gradient descent is where you set a batch size and then on every iteration you compute your gradient with respect to not just one data point but let's say k data points where K is pretty small think of something like 32 or 128 something on that scale you look at your gradient with respect to those let's say 32 data points and then you average that gradient it helps you get a bit more reliability and robustness in your measure but then you
also get the speed right you're not going over your entire data set 32 is usually way way smaller than your entire data set okay so now what does this mean this means that we now have this increase in gradient accuracy compared to stochastic gradient descent so we can we can converge much more smoothly we're not super noisy going after one data point one at a time but it also means that we can be much more uh quick than compared to uh full gradient descent where we go over the entire data set at a whole this
means that you know because we're more stable on the one side we can also increase our learning rate these two things are extremely connected right the relationship between your gradients and your learning rates should be one that you have a very good intuition about because your gradients are now more stable you're averaging over a mini batch not just a single sample you can now start to uh take bigger steps right you can trust the gradient a bit more over over the course of optimization it also allows you to really parallelize training because if you wanted
to compute your gradient over 32 data points you can parallelize that off of 32 processes on your GPU right you compute them in parallel as opposed to one at a time this allows you to really start to utilize GPU speedups even further now the last topic I'll touch on before we uh we take a short break for lecture two is going to be this topic of overfitting and regularization of neural networks and this is a huge problem not just in deep learning but we want to cover it because it's one that you're going to get
exposure with in today's lab especially is basically it's one of the most fundamental topics of all of machine learning as a whole ideally in machine learning we want to build models that don't just work well on a training set right we do train our models on training sets but we don't want them to work well only on our training set actually what we really want is we we actually don't really often times care about how well it works in practice on our training side at all we use use that as a proxy because what we
really care about is how well the model works on brand new data when we deploy it into the wild and there it's not our training data at all it's brand new test data and the relationship between these two things is extremely important we use the training data as a proxy But ultimately we don't really really care about it all that much another way to say this is that when we build models we want to learn representations from our training data but we still want them to generalize to unseen test data as well now take this
picture for example assume you want to build a line that describes the relationship between the X and the Y points on this picture you know on the left hand side you can see that you have a very simple Model A linear model it can describe the training points and it probably will also describe the the test points to some decent faithfulness but it's not fully capturing the richness and the complexity of our data set both in the training set and the test set so we're not utilizing the full expressive capacity of the model on the
on the left hand side move over all the way to the right hand side you can actually see that we're starting to memorize data points in the training side so much so that we're hurting our performance for brand new test data because we're we're waiting too much on what we've seen during training basically what you always want is to end up in the middle you want to leverage your training points but not rely on them too much or memorize them now yes example for problem of overfitting oh sorry say any real example of the problem
which we face in the overfitting yes of course so a real life example of overfitting would be let's say if you have a very small data set but a very large Network you'll you'll learn a model that just memorizes uh all of the data in your data set and it will be it's it's not like it's doing something bad because uh it has the power to memorize everything in the training set remember always that models don't see test set it's unseen data so all they can see is your training set what you give it to
them so if you give them a very small training set and a very big model the model will do what it's supposed to do and learn exactly the training set to the full capacity right but then when you show it more test data it's not going to be very faithful to the training data because it's not going to be perfectly from the same distribution okay yep maybe well I think with this s of example the idea is to I see I see yeah so the stochasticity is coming purely from the selection operator so maybe it's
a confusion so when so what is why do we call it stochastic gradi say it's because of the selection process we don't do this over the entire data set but we stochastically select a subset of data and that selection is stochastic yeah makes sense no no no so so you take the stochastic selection and then with that stochastic selection of data the gradient is is I mean it can be unbounded right so you you you grab or you compute the gradient with respect to those data points whatever they may be but your stochasticity is coming
from the selection part not from the gradient computation yes way of what you and then maybe com exactly so basically the the question is about is there a way is there a more adaptive way almost of doing selection as opposed to being truly stochastic and the answer is yes definitely so truly stochastic uh uh you know seeing of data is is actually not very realistic either right even though this is the way that is is the convention right we as humans do not operate right like this right we don't just randomly see data we see
data sequentially over time and we see data with with meaning and with purpose actually in tomorrow's lecture you'll see an example of how we do this type of adaptive selection process and and the benefits of this as well great question okay so I'll just very briefly wrap up with regularization so regularization is just a technique that allows you to discourage these complex memorization protocol so if you have a very small data set big model you want to discourage the model from just memorizing that data set so how can you discourage the model from from those
types of things to to be learned and you know as we've seen this is really critical for the the overall performance of them all because we don't care about the training results we care about the test results ultimately the most popular regularization technique is actually a very simple idea you'll use this in almost all of your Labs as part of this course it's the idea of Dropout so what is Dropout let's revisit this picture of a deep neural network in Dropout all we do is that during training we're going to randomly set some activations of
our hidden neurons to zero with some probability so let's say we set drop out to 50% what we're going to do is say 50% of our neurons we're going to drop out the activations or set their activations to zero which forces the network to not rely so much on the outputs of any one neuron right the inputs at the next layer after a neuron gets dropped it cannot rely it cannot memorize so much about the previous inputs because there are some more stochasticity being implemented into this forward pass of the model not just in the
data set curation where this data set selection but also in just the pure forward pass even if I pick the same data twice and I put it through the model twice the exact same data because of Dropout you also have another level of stochasticity that means the model can't even remember the same exact data twice right this is an extremely powerful idea because basically all it's doing is it's lowering the capacity of the model it's lowering the ability or it's discouraging the ability for the model to learn a singular pathway through the model it's forcing
the model to learn these multiple Pathways to make a single decision and basically on every single iteration we just repeat this process every time it sees a new piece of data or every we do a forward pass it always creates a random pathway for this data to pass through the model another final technique that I'll show you is about this notion of early stopping early stopping basically just means that we monitor the deviation between our training loss and our test loss so we can have a test we can have a proxy of a test loss
by having a a hell do set maybe it's not a true test loss but it's again another proxy that we do not train on and what we can do is we can basically monitor how well the model is doing on both the training set and our heldout let's call it a validation set in the beginning both of these lines as we train they both start to go down which is excellent it makes sense right this is because the model is learning right it's getting stronger over the course of training and eventually what you'll see is
that the model starts to Plateau its loss and on the test it starts to increase so the training accuracy should if the model has enough capacity the training accuracy should always excuse me the training loss should always go down should always be getting better and better on the training set but at some point you will see that the test loss starts to memorize data it starts to memorize data in the training loss which results in the test loss to go up a little bit now this pattern continues for the rest of training and here's the
point that you should really focus on right this is the point where that if you plotted this curve you would save your model at each of these stages but you would only take the checkpoint you would take the model that happens at this point because this is the even though the training loss even got better after this point you on if you look at your training set you actually look like you have a better model but on the test set you can see that it's actually started to memorize pieces of the training set so you
do not take the models on the far right you actually take these models in the middle yes training iterations and every training iteration not every iteration because maybe it adds unnecessary compute but what people typically do is you know let's say once every so many iterations you will do a testing run and again you don't need to do a testing run over your entire test set you could do it stochastically as well in a batch right so let's say you could you could do let's say every thousand iterations you do a batch of let's say
only 100 data points in your test set just to get a approximate uh no so the drop nodes will not have gradients because we don't have uh information that of what's happening with them but for all of the other nodes we'll get a we'll get an update y exactly yes for this to work it should be separate yeah so this is a key assumption is that ideally you take your training data and what people can do is basically cut your training data in a ratio right so let's say you take 70% of your training data
and you actually use it for training you take the other 30% of your training data and use it for testing and inal validation right okay last question feel a difference in loss between the testing and training data sets great question um I mean I there's no ideal right ideally actually there would be no difference right um in practice though so there are situations actually where there are very little difference Let me Give an example is assume your training set is is also so massive that it's impossible for your model to learn the the full cap
it's impossible for the model to memorize then actually you will see basically training and testing is very close to each other a good example of this is language modeling even massive language models they still have trouble memorizing the entire data set just because language is such a massive data set right uh so even there basically you'll see training and testing curves look very very similar but then that's why we have to actually do other typs of validation language models don't really have the classical overfitting problems that you know other types of deep learning models have
they have other problems which we'll talk about yeah okay awesome okay I'll conclude now just by summarizing the three points that we talked about in this lecture before we jump into lecture number two uh so first we talked about you know building neural networks the architectures of neural networks we talked about the base operation the base architecture is called a perceptron a single neuron we learned about how we could stack those single neurons together to form complex hierarchical networks and how we can mathematically optimize those networks using data and finally we addressed a lot of
the Practical implications everything from you know batch gradient descent to overfitting and regularization and optimization of these models in the next lecture we're going to hear from Ava on deep sequence modeling which is the backbone of large language models uh and this is a really exciting type of lecture so hopefully everyone enjoys it and I think probably what we'll do is just take a five minute break just so AA and I can switch laptops and then we will continue with the lecture and then after the lecture we have software Labs followed by reception at link
and food okay thanks everyone
Related Videos
MIT 6.S191: Recurrent Neural Networks, Transformers, and Attention
1:01:34
MIT 6.S191: Recurrent Neural Networks, Tra...
Alexander Amini
70,002 views
MIT 6.S191 (Google): Large Language Models
55:52
MIT 6.S191 (Google): Large Language Models
Alexander Amini
28,018 views
Veritasium: What Everyone Gets Wrong About AI and Learning – Derek Muller Explains
1:15:11
Veritasium: What Everyone Gets Wrong About...
Perimeter Institute for Theoretical Physics
1,295,994 views
Quantum Computing: Where We Are and Where We’re Headed | NVIDIA GTC 2025 Fireside Chat
2:03:20
Quantum Computing: Where We Are and Where ...
NVIDIA Developer
49,569 views
The Biggest Misconception in Physics
27:40
The Biggest Misconception in Physics
Veritasium
6,117,053 views
A Video About Digging A Hole
27:56
A Video About Digging A Hole
Jacob Geller
217,080 views
What's next for AI at DeepMind, Google's artificial intelligence lab | 60 Minutes
14:01
What's next for AI at DeepMind, Google's a...
60 Minutes
1,087,110 views
Why Pantone Colors Are So Expensive | So Expensive | Business Insider
22:36
Why Pantone Colors Are So Expensive | So E...
Business Insider
1,841,489 views
MIT 6.S191 (Liquid AI): Large Language Models
1:08:18
MIT 6.S191 (Liquid AI): Large Language Models
Alexander Amini
11,356 views
Jeffrey Wasserstrom: China, Xi Jinping, Trade War, Taiwan, Hong Kong, Mao | Lex Fridman Podcast #466
3:04:01
Jeffrey Wasserstrom: China, Xi Jinping, Tr...
Lex Fridman
52,301 views
Visualizing transformers and attention | Talk for TNG Big Tech Day '24
57:45
Visualizing transformers and attention | T...
Grant Sanderson
609,614 views
MIT 6.S191: Convolutional Neural Networks
1:01:04
MIT 6.S191: Convolutional Neural Networks
Alexander Amini
30,491 views
How do Graphics Cards Work?  Exploring GPU Architecture
28:30
How do Graphics Cards Work? Exploring GPU...
Branch Education
4,389,452 views
The Man Who Almost Broke Math (And Himself...)
33:01
The Man Who Almost Broke Math (And Himself...
Veritasium
8,395,022 views
AI Snake Oil: What Artificial Intelligence Can Do, What It Can’t, and How to Tell the Difference
56:49
AI Snake Oil: What Artificial Intelligence...
MIT Shaping the Future of Work Initiative
4,605 views
Physicist Brian Cox explains quantum physics in 22 minutes
22:19
Physicist Brian Cox explains quantum physi...
Big Think
1,032,850 views
MIT 6.S191 (2023): Deep Generative Modeling
59:52
MIT 6.S191 (2023): Deep Generative Modeling
Alexander Amini
317,000 views
NVIDIA CEO Jensen Huang's Vision for the Future
1:03:03
NVIDIA CEO Jensen Huang's Vision for the F...
Cleo Abram
2,470,648 views
Are US Ports Empty and What Impact Does the Tariff Have on Global Shipping?
23:12
Are US Ports Empty and What Impact Does th...
What's Going on With Shipping?
157,708 views
Amateurs Solve a Famous Computer Science Problem On Discord
11:47
Amateurs Solve a Famous Computer Science P...
Quanta Magazine
405,695 views
Copyright © 2025. Made with ♥ in London by YTScribe.com