Gradient Descent Explained

81.78k views1081 WordsCopy TextShare

IBM Technology

Learn more about WatsonX → https://ibm.biz/BdPu9e What is Gradient Descent? → https://ibm.biz/Gr...

Video Transcript:

gradient descent is like it's like trying to find your way down a dark Mountain you can't see where you're going so you have to feel your way around you take small steps in the direction that feels the most downhill eventually if you keep going you'll find your way to the bottom that's gradient descent let's get into it so gradient descent is a common optimization algorithm used to train machine learning models and neural networks by training on data these models can learn over time and because they're learning over time they can improve their accuracy now you

see a neural network consists of connected neurons like this and these neurons are in layers and those layers have weights and biases which describe how we navigate through this network we provide the neural network with labeled training data to determine what we should set these weights and biases to to figure something out so like for example I could input a shape let's say like that and then we could use the neural network to learn that squiggle as our input represents this output for number three after we train the neural network we can provide it with

more labeled data like this squiggle and then we can see if it could also correctly resolve that squiggle to the number six if it gets some of these squiggles wrong the the weights and biases here can be adjusted and then we just try it again now how can gradient descent help us here well gradient descent is used to find the minimum of something called a cost function so what is a cost function well it's a function that tells us how far off our predictions are from the actual values so the idea is that we want

to minimize this cost function to get the best predictions now to do this we take small steps in the direction that reduces the cost function the most if we think about this on a graph we start here and we keep going downhill reducing our cost function as we go the size of the steps that we take so the size of the steps from here to here and to here that's called The Learning rate let's think about another example let's consider a neural network but instead of dealing with squiggles predicts how much a house will sell

for so first we train the network on a labeled data set let's say that data has some information like um like the location of a house let's say the size of the house and then how much it sold for so with that we can then use our model to train new labeled data so here's a here's another example we've got a house uh it's location let's do it by ZIP code 275 from three how big is it uh 3 000 square feet input that into our neural network so how much does this house sell for

well now our neural network will make a forecast it says we think is sold for three hundred thousand dollars and we compare that to the forecast of the actual sale price which was 450 000 dollars not a good guess we have a large cost function weights and biases now need to be adjusted and then the model can try again and did it do any better over the entire label data set or did it do worse that's what gradient descent can help us with now there are three types of gradient descent learning algorithms and let's take

a look at some of those so first of all we've got a gradient descent called batch this sums the entries for each point in a training set updating the model only after all the training examples have been evaluated hence the term batch now in terms of how well does this do well computationally it is computationally effective you can give this a high rating because we're doing things in one big batch but what about processing time well with processing time we can end up with long processing times using batch gradient descent because well we've got large

training data sets and it needs to store all of that data in memory and process it so that's batch another option is stochastic gradient descent and this evaluates each training example but one at a time instead of in a batch since you only need to hold one training example they're easy to store in memory and get individual responses much faster so in terms of speed that's fast but in terms of computational efficiency that's lower now there is a happy medium and that is called mini batch and mini batch gradient descent splits the training data set

into small batch sizes and performs updates on each of those batches that is a nice balance of computational efficiency and of speed now gradient descent does come with its own challenges so for example it can struggle to find the global minimum in non-convex problems this was a nice convex problem with a clearly defined bottom so when are the slope of the cost function is close to zero or it's at zero the model stops learning but if we don't have this convex model here that we have something like this shape that's known as a saddle point

and it can mislead the gradient descent because it thinks it's at the bottom before it really is this is going to keep going down further chord a subtle shape because it kind of looks like a horse saddle I guess another challenge is that in deeper neural learning networks a gradient descent can suffer from vanish ingredients or exploding gradients so Vanishing gradients are when the gradient is too small and the earlier layers in the network learn more slowly than the later layers as we go through this network here exploding gradients on the other hand are when

the gradient is too large and that can create an unstable model but look despite those challenges gradient descent is a powerful optimization algorithm and it is commonly used to train machine learning models and neural networks today it's a clever way to get you back down that mountain safely if you have any questions please drop us a line below and if you want to see more videos like this in the future please like And subscribe thanks for watching