Machine Learning for Everybody – Full Course

7.09M views32641 WordsCopy TextShare
freeCodeCamp.org
Learn Machine Learning in a way that is accessible to absolute beginners. You will learn the basics ...
Video Transcript:
Kylie Ying has worked at many interesting places such as MIT, CERN, and Free Code Camp. She's a physicist, engineer, and basically a genius. And now she's going to teach you about machine learning in a way that is accessible to absolute beginners. What's up you guys? So welcome to Machine Learning for Everyone. If you are someone who is interested in machine learning and you think you are considered as everyone, then this video is for you. In this video, we'll talk about supervised and unsupervised learning models, we'll go through maybe a little bit of the logic or
math behind them, and then we'll also see how we can program it on Google CoLab. If there are certain things that I have done, and you know, you're somebody with more experience than me, please feel free to correct me in the comments and we can all as a community learn from this together. So with that, let's just dive right in. Without wasting any time, let's just dive straight into the code and I will be teaching you guys concepts as we go. So this here is the UCI machine learning repository. And basically, they just have a
ton of data sets that we can access. And I found this really cool one called the magic gamma telescope data set. So in this data set, if you want to read all this information, to summarize what I what I think is going on, is there's this gamma telescope, and we have all these high energy particles hitting the telescope. Now there's a camera, there's a detector that actually records certain patterns of you know, how this light hits the camera. And we can use properties of those patterns in order to predict what type of particle caused that
radiation. So whether it was a gamma particle, or some other head, like hadron. Down here, these are all of the attributes of those patterns that we collect in the camera. So you can see that there's, you know, some length, width, size, asymmetry, etc. Now we're going to use all these properties to help us discriminate the patterns and whether or not they came from a gamma particle or hadron. So in order to do this, we're going to come up here, go to the data folder. And you're going to click this magic zero for data, and we're
going to download that. Now over here, I have a colab notebook open. So you go to colab dot research dot google.com, you start a new notebook. And I'm just going to call this the magic data set. So actually, I'm going to call this for code camp magic example. Okay. So with that, I'm going to first start with some imports. So I will import, you know, I always import NumPy, I always import pandas. And I always import matplotlib. And then we'll import other things as we go. So yeah, we run that in order to run the
cell, you can either click this play button here, or you can on my computer, it's just shift enter and that that will run the cell. And here, I'm just going to order I'm just going to, you know, let you guys know, okay, this is where I found the data set. So I've copied and pasted this actually, but this is just where I found the data set. And in order to import that downloaded file that we we got from the computer, we're going to go over here to this folder thing. And I am literally just going
to drag and drop that file into here. Okay. So in order to take a look at, you know, what does this file consist of, do we have the labels? Do we not? I mean, we could open it on our computer, but we can also just do pandas read CSV. And we can pass in the name of this file. And let's see what it returns. So it doesn't seem like we have the label. So let's go back to here. I'm just going to make the columns, the column labels, all of these attribute names over here. So
I'm just going to take these values and make that the column names. All right, how do I do that? So basically, I will come back here, and I will create a list called calls. And I will type in all of those things. With f size, f conk. And we also have f conk one. We have f symmetry, f m three long, f m three trans, f alpha. Let's see, we have f dist and class. Okay, great. Now in order to label those as these columns down here in our data frame. So basically, this command here
just reads some CSV file that you pass in CSV has come about comma separated values, and turns that into a pandas data frame object. So now if I pass in a names here, then it basically assigns these labels to the columns of this data set. So I'm going to set this data frame equal to DF. And then if we call the head is just like, give me the first five things, give me the first five things. Now you'll see that we have labels for all of these. Okay. All right, great. So one thing that you
might notice is that over here, the class labels, we have G and H. So if I actually go down here, and I do data frame class unique, you'll see that I have either G's or H's, and these stand for gammas or hadrons. And our computer is not so good at understanding letters, right? Our computer is really good at understanding numbers. So what we're going to do is we're going to convert this to zero for G and one for H. So here, I'm going to set this equal to this, whether or not that equals G. And
then I'm just going to say as type int. So what this should do is convert this entire column, if it equals G, then this is true. So I guess that would be one. And then if it's H, it would be false. So that would be zero, but I'm just converting G and H to one and zero, it doesn't really matter. Like, if G is one and H is zero or vice versa. Let me just take a step back right now and talk about this data set. So here I have some data frame, and I have
all of these different values for each entry. Now this is a you know, each of these is one sample, it's one example, it's one item in our data set, it's one data point, all of these things are kind of the same thing when I mentioned, oh, this is one example, or this is one sample or whatever. Now, each of these samples, they have, you know, one quality for each or one value for each of these labels up here, and then it has the class. Now what we're going to do in this specific example is try
to predict for future, you know, samples, whether the class is G for gamma or H for hadron. And that is something known as classification. Now, all of these up here, these are known as our features, and features are just things that we're going to pass into our model in order to help us predict the label, which in this case is the class column. So for you know, sample zero, I have 10 different features. So I have 10 different values that I can pass into some model. And I can spit out, you know, the class the
label, and I know the true label here is G. So this is this is actually supervised learning. All right. So before I move on, let me just give you a quick little crash course on what I just said. This is machine learning for everyone. Well, the first question is, what is machine learning? Well, machine learning is a sub domain of computer science that focuses on certain algorithms, which might help a computer learn from data, without a programmer being there telling the computer exactly what to do. That's what we call explicit programming. So you might have
heard of AI and ML and data science, what is the difference between all of these. So AI is artificial intelligence. And that's an area of computer science, where the goal is to enable computers and machines to perform human like tasks and simulate human behavior. Now machine learning is a subset of AI that tries to solve one specific problem and make predictions using certain data. And data science is a field that attempts to find patterns and draw insights from data. And that might mean we're using machine learning. So all of these fields kind of overlap, and
all of them might use machine learning. So there are a few types of machine learning. The first one is supervised learning. And in supervised learning, we're using labeled inputs. So this means whatever input we get, we have a corresponding output label, in order to train models and to learn outputs of different new inputs that we might feed our model. So for example, I might have these pictures, okay, to a computer, all these pictures are are pixels, they're pixels with a certain color. Now in supervised learning, all of these inputs have a label associated with them,
this is the output that we might want the computer to be able to predict. So for example, over here, this picture is a cat, this picture is a dog, and this picture is a lizard. Now there's also unsupervised learning. And in unsupervised learning, we use unlabeled data to learn about patterns in the data. So here are here are my input data points. Again, they're just images, they're just pixels. Well, okay, let's say I have a bunch of these different pictures. And what I can do is I can feed all these to my computer. And I
might not, you know, my computer is not going to be able to say, Oh, this is a cat, dog and lizard in terms of, you know, the output. But it might be able to cluster all these pictures, it might say, Hey, all of these have something in common. All of these have something in common. And then these down here have something in common, that's finding some sort of structure in our unlabeled data. And finally, we have reinforcement learning. And reinforcement learning. Well, they usually there's an agent that is learning in some sort of interactive environment,
based on rewards and penalties. So let's think of a dog, we can train our dog, but there's not necessarily, you know, any wrong or right output at any given moment, right? Well, let's pretend that dog is a computer. Essentially, what we're doing is we're giving rewards to our computer, and tell your computer, Hey, this is probably something good that you want to keep doing. Well, computer agent terminology. But in this class today, we'll be focusing on supervised learning and unsupervised learning and learning different models for each of those. Alright, so let's talk about supervised learning
first. So this is kind of what a machine learning model looks like you have a bunch of inputs that are going into some model. And then the model is spitting out an output, which is our prediction. So all these inputs, this is what we call the feature vector. Now there are different types of features that we can have, we might have qualitative features. And qualitative means categorical data, there's either a finite number of categories or groups. So one example of a qualitative feature might be gender. And in this case, there's only two here, it's for
the sake of the example, I know this might be a little bit outdated. Here we have a girl and a boy, there are two genders, there are two different categories. That's a piece of qualitative data. Another example might be okay, we have, you know, a bunch of different nationalities, maybe a nationality or a nation or a location, that might also be an example of categorical data. Now, in both of these, there's no inherent order. It's not like, you know, we can rate us one and France to Japan three, etc. Right? There's not really any inherent
order built into either of these categorical data sets. That's why we call this nominal data. Now, for nominal data, the way that we want to feed it into our computer is using something called one hot encoding. So let's say that, you know, I have a data set, some of the items in our data, some of the inputs might be from the US, some might be from India, then Canada, then France. Now, how do we get our computer to recognize that we have to do something called one hot encoding. And basically, one hot encoding is saying,
okay, well, if it matches some category, make that a one. And if it doesn't just make that a zero. So for example, if your input were from the US, you would you might have 1000. India, you know, 0100. Canada, okay, well, the item representing Canada is one and then France, the item representing France is one. And then you can see that the rest are zeros, that's one hot encoding. Now, there are also a different type of qualitative feature. So here on the left, there are different age groups, there's babies, toddlers, teenagers, young adults, adults, and
so on, right. And on the right hand side, we might have different ratings. So maybe bad, not so good, mediocre, good, and then like, great. Now, these are known as ordinal pieces of data, because they have some sort of inherent order, right? Like, being a toddler is a lot closer to being a baby than being an elderly person, right? Or good is closer to great than it is to really bad. So these have some sort of inherent ordering system. And so for these types of data sets, we can actually just mark them from, you know,
one to five, or we can just say, hey, for each of these, let's give it a number. And this makes sense. Because, like, for example, the thing that I just said, how good is closer to great, then good is close to not good at all. Well, four is closer to five, then four is close to one. So this actually kind of makes sense. And it'll make sense for the computer as well. Alright, there are also quantitative pieces of data and quantitative pieces of data are numerical valued pieces of data. So this could be discrete, which
means, you know, they might be integers, or it could be continuous, which means all real numbers. So for example, the length of something is a quantitative piece of data, it's a quantitative feature, the temperature of something is a quantitative feature. And then maybe how many Easter eggs I collected in my basket, this Easter egg hunt, that is an example of discrete quantitative feature. Okay, so these are continuous. And this over here is the screen. So those are the things that go into our feature vector, those are our features that we're feeding this model, because our
computers are really, really good at understanding math, right at understanding numbers, they're not so good at understanding things that humans might be able to understand. Well, what are the types of predictions that our model can output? So in supervised learning, there are some different tasks, there's one classification, and basically classification, just saying, okay, predict discrete classes. And that might mean, you know, this is a hot dog, this is a pizza, and this is ice cream. Okay, so there are three distinct classes and any other pictures of hot dogs, pizza or ice cream, I can put
under these labels. Hot dog, pizza, ice cream. Hot dog, pizza, ice cream. This is something known as multi class classification. But there's also binary classification. And binary classification, you might have hot dog, or not hot dog. So there's only two categories that you're working with something that is something and something that's isn't binary classification. Okay, so yeah, other examples. So if something has positive or negative sentiment, that's binary classification. Maybe you're predicting your pictures of their cats or dogs. That's binary classification. Maybe, you know, you are writing an email filter, and you're trying to figure
out if an email spam or not spam. So that's also binary classification. Now for multi class classification, you might have, you know, cat, dog, lizard, dolphin, shark, rabbit, etc. We might have different types of fruits like orange, apple, pear, etc. And then maybe different plant species. But multi class classification just means more than two. Okay, and binary means we're predicting between two things. There's also something called regression when we talk about supervised learning. And this just means we're trying to predict continuous values. So instead of just trying to predict different categories, we're trying to come
up with a number that you know, is on some sort of scale. So some examples. So some examples might be the price of aetherium tomorrow, or it might be okay, what is going to be the temperature? Or it might be what is the price of this house? Right? So these things don't really fit into discrete classes. We're trying to predict a number that's as close to the true value as possible using different features of our data set. So that's exactly what our model looks like in supervised learning. Now let's talk about the model itself. How
do we make this model learn? Or how can we tell whether or not it's even learning? So before we talk about the models, let's talk about how can we actually like evaluate these models? Or how can we tell whether something is a good model or bad model? So let's take a look at this data set. So this data set has this is from a diabetes, a Pima Indian diabetes data set. And here we have different number of pregnancies, different glucose levels, blood pressure, skin thickness, insulin, BMI, age, and then the outcome whether or not they
have diabetes one for they do zero for they don't. So here, all of these are quantitative features, right, because they're all on some scale. So each row is a different sample in the data. So it's a different example, it's one person's data, and each row represents one person in this data set. Now this column, each column represents a different feature. So this one here is some measure of blood pressure levels. And this one over here, as we mentioned is the output label. So this one is whether or not they have diabetes. And as I mentioned,
this is what we would call a feature vector, because these are all of our features in one sample. And this is what's known as the target, or the output for that feature vector. That's what we're trying to predict. And all of these together is our features matrix x. And over here, this is our labels or targets vector y. So I've condensed this to a chocolate bar to kind of talk about some of the other concepts in machine learning. So over here, we have our x, our features matrix, and over here, this is our label y.
So each row of this will be fed into our model, right. And our model will make some sort of prediction. And what we do is we compare that prediction to the actual value of y that we have in our label data set, because that's the whole point of supervised learning is we can compare what our model is outputting to, oh, what is the truth, actually, and then we can go back and we can adjust some things. So the next iteration, we get closer to what the true value is. So that whole process here, the tinkering
that, okay, what's the difference? Where did we go wrong? That's what's known as training the model. Alright, so take this whole, you know, chunk right here, do we want to really put our entire chocolate bar into the model to train our model? Not really, right? Because if we did that, then how do we know that our model can do well on new data that we haven't seen? Like, if I were to create a model to predict whether or not someone has diabetes, let's say that I just train all my data, and I see that all
my training data does well, I go to some hospital, I'm like, here's my model. I think you can use this to predict if somebody has diabetes. Do we think that would be effective or not? Probably not, right? Because we haven't assessed how well our model can generalize. Okay, it might do well after you know, our model has seen this data over and over and over again. But what about new data? Can our model handle new data? Well, how do we how do we get our model to assess that? So we actually break up our whole
data set that we have into three different types of data sets, we call it the training data set, the validation data set and the testing data set. And you know, you might have 60% here 20% and 20% or 80 10 and 10. It really depends on how many statistics you have, I think either of those would be acceptable. So what we do is then we feed the training data set into our model, we come up with, you know, this might be a vector of predictions corresponding with each sample that we put into our model, we
figure out, okay, what's the difference between our prediction and the true values, this is something known as loss, losses, you know, what's the difference here, in some numerical quantity, of course. And then we make adjustments, and that's what we call training. Okay. So then, once you know, we've made a bunch of adjustments, we can put our validation set through this model. And the validation set is kind of used as a reality check during or after training to ensure that the model can handle unseen data still. So every single time after we train one iteration, we
might stick the validation set in and see, hey, what's the loss there. And then after our training is over, we can assess the validation set and ask, hey, what's the loss there. But one key difference here is that we don't have that training step, this loss never gets fed back into the model, right, that feedback loop is not closed. Alright, so let's talk about loss really quickly. So here, I have four different types of models, I have some sort of data that's being fed into the model, and then some output. Okay, so this output here
is pretty far from you know, this truth that we want. And so this loss is going to be high. In model B, again, this is pretty far from what we want. So this loss is also going to be high, let's give it 1.5. Now this one here, it's pretty close, I mean, maybe not almost, but pretty close to this one. So that might have a loss of 0.5. And then this one here is maybe further than this, but still better than these two. So that loss might be 0.9. Okay, so which of these model performs
the best? Well, model C has a smallest loss, so it's probably model C. Okay, now let's take model C. After you know, we've come up with these, all these models, and we've seen, okay, model C is probably the best model. We take model C, and we run our test set through this model. And this test set is used as a final check to see how generalizable that chosen model is. So if I, you know, finish training my diabetes data set, then I could run it through some chunk of the data and I can say, oh,
like, this is how we perform on data that it's never seen before at any point during the training process. Okay. And that loss, that's the final reported performance of my test set, or this would be the final reported performance of my model. Okay. So let's talk about this thing called loss, because I think I kind of just glossed over it, right? So loss is the difference between your prediction and the actual, like, label. So this would give a slightly higher loss than this. And this would even give a higher loss, because it's even more off.
In computer science, we like formulas, right? We like formulaic ways of describing things. So here are some examples of loss functions and how we can actually come up with numbers. This here is known as L one loss. And basically, L one loss just takes the absolute value of whatever your you know, real value is, whatever the real output label is, subtracts the predicted value, and takes the absolute value of that. Okay. So the absolute value is a function that looks something like this. So the further off you are, the greater your losses, right in either
direction. So if your real value is off from your predicted value by 10, then your loss for that point would be 10. And then this sum here just means, hey, we're taking all the points in our data set. And we're trying to figure out the sum of how far everything is. Now, we also have something called L two loss. So this loss function is quadratic, which means that if it's close, the penalty is very minimal. And if it's off by a lot, then the penalty is much, much higher. Okay. And this instead of the absolute
value, we just square the the difference between the two. Now, there's also something called binary cross entropy loss. It looks something like this. And this is for binary classification, this this might be the loss that we use. So this loss, you know, I'm not going to really go through it too much. But you just need to know that loss decreases as the performance gets better. So there are some other measures of accurate or performance as well. So for example, accuracy, what is accuracy? So let's say that these are pictures that I'm feeding my model, okay.
And these predictions might be apple, orange, orange, apple, okay, but the actual is apple, orange, apple, apple. So three of them were correct. And one of them was incorrect. So the accuracy of this model is three quarters or 75%. Alright, coming back to our colab notebook, I'm going to close this a little bit. Again, we've imported stuff up here. And we've already created our data frame right here. And this is this is all of our data. This is what we're going to use to train our models. So down here, again, if we now take a
look at our data set, you'll see that our classes are now zeros and ones. So now this is all numerical, which is good, because our computer can now understand that. Okay. And you know, it would probably be a good idea to maybe kind of plot, hey, do these things have anything to do with the class. So here, I'm going to go through all the labels. So for label in the columns of this data frame. So this just gets me the list. Actually, we have the list, right? It's called so let's just use that might be
less confusing of everything up to the last thing, which is the class. So I'm going to take all these 10 different features. And I'm going to plot them as a histogram. So and now I'm going to plot them as a histogram. So basically, if I take that data frame, and I say, okay, for everything where the class is equal to one, so these are all of our gammas, remember, now, for that portion of the data frame, if I look at this label, so now these, okay, what this part here is saying is, inside the data
frame, get me everything where the class is equal to one. So that's all all of these would fit into that category, right? And now let's just look at the label column. So the first label would be f length, which would be this column. So this command here is getting me all the different values that belong to class one for this specific label. And that's exactly what I'm going to put into the histogram. And now I'm just going to tell you know, matplotlib make the color blue, make this label this as you know, gamma set alpha,
why do I keep doing that, alpha equal to 0.7. So that's just like the transparency. And then I'm going to set density equal to true, so that when we compare it to the hadrons here, we'll have a baseline for comparing them. Okay, so the density being true just basically normalizes these distributions. So you know, if you have 200 in of one type, and then 50 of another type, well, if you drew the histograms, it would be hard to compare because one of them would be a lot bigger than the other, right. But by normalizing them,
we kind of are distributing them over how many samples there are. Alright, and then I'm just going to put a title on here and make that the label, the y label. So because it's density, the y label is probability. And the x label is just going to be the label. What is going on. And I'm going to include a legend and PLT dot show just means okay, display the plot. So if I run that, just be up to the last item. So we want a list, right, not just the last item. And now we can
see that we're plotting all of these. So here we have the length. Oh, and I made this gamma. So this should be hadron. Okay, so the gammas in blue, the hadrons are in red. So here we can already see that, you know, maybe if the length is smaller, it's probably more likely to be gamma, right. And we can kind of you know, these all look somewhat similar. But here, okay, clearly, if there's more asymmetry, or if you know, this asymmetry measure is larger, then it's probably hadron. Okay, oh, this one's a good one. So f
alpha seems like hadrons are pretty evenly distributed. Whereas if this is smaller, it looks like there's more gammas in that area. Okay, so this is kind of what the data that we're working with, we can kind of see what's going on. Okay, so the next thing that we're going to do here is we are going to create our train, our validation, and our test data sets. I'm going to set train valid and test to be equal to this. So NumPy dot split, I'm just splitting up the data frame. And if I do this sample, where
I'm sampling everything, this will basically shuffle my data. Now, if I I want to pass in where exactly I'm splitting my data set, so the first split is going to be maybe at 60%. So I'm going to say 0.6 times the length of this data frame. So and then cast that 10 integer, that's going to be the first place where you know, I cut it off, and that'll be my training data. Now, if I then go to 0.8, this basically means everything between 60% and 80% of the length of the data set will go towards
validation. And then, like everything from 80 to 100, I'm going to pass my test data. So I can run that. And now, if we go up here, and we inspect this data, we'll see that these columns seem to have values in like the 100s, whereas this one is 0.03. Right? So the scale of all these numbers is way off. And sometimes that will affect our results. So I'm going to run this is way off. And sometimes that will affect our results. So one thing that we would want to do is scale these so that they
are, you know, so that it's now relative to maybe the mean and the standard deviation of that specific column. I'm going to create a function called scale data set. And I'm going to pass in the data frame. And that's what I'll do for now. Okay, so the x values are going to be, you know, I take the data frame. And let's assume that the columns are going to be, you know, that the label will always be the last thing in the data frame. So what I can do is say data frame, dot columns all the
way up to the last item, and get those values. Now for my y, well, it's the last column. So I can just do this, I can just index into that last column, and then get those values. Now, in, so I'm actually going to import something known as the standard scalar from sk learn. So if I come up here, I can go to sk learn dot pre processing. And I'm going to import standard scalar, I have to run that cell, I'm going to come back down here. And now I'm going to create a scalar and use
that skip or so standard scalar. And with the scalar, what I can do is actually just fit and transform x. So here, I can say x is equal to scalar dot fit, fit, transform x. So what that's doing is saying, okay, take x and fit the standard scalar to x, and then transform all those values. And what would it be? And that's going to be our new x. Alright. And then I'm also going to just create, you know, the whole data as one huge 2d NumPy array. And in order to do that, I'm going to
call H stack. So H stack is saying, okay, take an array, and another array and horizontally stack them together. That's what the H stands for. So by horizontally stacked them together, just like put them side by side, okay, not on top of each other. So what am I stacking? Well, I have to pass in something so that it can stack x and y. And now, okay, so NumPy is very particular about dimensions, right? So in this specific case, our x is a two dimensional object, but y is only a one dimensional thing, it's only a
vector of values. So in order to now reshape it into a 2d item, we have to call NumPy dot reshape. And we can pass in the dimensions of its reshape. So if I pass in negative one comma one, that just means okay, make this a 2d array, where the negative one just means infer what what this dimension value would be, which ends up being the length of y, this would be the same as literally doing this. But the negative one is easier because we're making the computer do the hard work. So if I stack that,
I'm going to then return the data x and y. Okay. So one more thing is that if we go into our training data set, okay, again, this is our training data set. And we get the length of the training data set. But where the training data sets class is one, so remember that this is the gammas. And then if we print that, and we do the same thing, but zero, we'll see that, you know, there's around 7000 of the gammas, but only around 4000 of the hadrons. So that might actually become an issue. And instead,
what we want to do is we want to oversample our our training data set. So that means that we want to increase the number of these values, so that these kind of match better. And surprise, surprise, there is something that we can import that will help us do that. It's so I'm going to go to from in the learn dot oversampling. And I'm going to import this random oversampler, run that cell, and come back down here. So I will actually add in this parameter called oversample, and set that to false for default. And if I
do want to oversample, then what I'm going to do, and by oversample, so if I do want to oversample, then I'm going to create this ROS and set it equal to this random oversampler. And then for x and y, I'm just going to say, okay, just fit and resample x and y. And what that's doing is saying, okay, take more of the less class. So take take the less class and keep sampling from there to increase the size of our data set of that smaller class so that they now match. So if I do this,
and I scale data set, and I pass in the training data set where oversample is true. So this let's say this is train and then x train, y train. Oops, what's going on? These should be columns. So basically, what I'm doing now is I'm just saying, okay, what is the length of y train? Okay, now it's 14,800, whatever. And now let's take a look at how many of these are type one. So actually, we can just sum that up. And then we'll also see that if we instead switch the label and ask how many of
them are the other type, it's the same value. So now these have been evenly, you know, rebalanced. Okay, well, okay. So here, I'm just going to make this the validation data set. And then the next one, I'm going to make this the test data set. Alright, and we're actually going to switch oversample here to false. Now, the reason why I'm switching that to false is because my validation and my test sets are for the purpose of you know, if I have data that I haven't seen yet, how does my sample perform on those? And I
don't want to oversample for that right now. Like, I don't care about balancing those I'm, I want to know if I have a random set of data that's unlabeled, can I trust my model, right? So that's why I'm not oversampling. I run that. And again, what is going on? Oh, it's because we already have this train. So I have to go come up here and split that data frame again. And now let's run these. Okay. So now we have our data properly formatted. And we're going to move on to different models now. And I'm going
to tell you guys a little bit about each of these models. And then I'm going to show you how we can do that in our code. So the first model that we're going to learn about is KNN or K nearest neighbors. Okay, so here, I've already drawn a plot on the y axis, I have the number of kids that a family might have. And then on the x axis, I have their income in terms of 1000s per year. So, you know, if if someone's making 40,000 a year, that's where this would be. And if somebody
making 320, that's where that would be somebody has zero kids, it'd be somewhere along this axis. Somebody has five, it'd be somewhere over here. Okay. And now I have these plus signs and these minus signs on here. So what I'm going to represent here is the plus sign means that they own a car. And the minus sign is going to represent no car. Okay. So your initial thought should be okay, I think this is binary classification because all of our points all of our samples have labels. So this is a sample with the plus label.
And this here is another sample with the minus label. This is an abbreviation for width that I'll use. Alright, so we have this entire data set. And maybe around half the people own a car and maybe around half the people don't own a car. Okay, well, what if I had some new point, let me use choose a different color, I'll use this nice green. Well, what if I have a new point over here? So let's say that somebody makes 40,000 a year and has two kids. What do we think that would be? Well, just logically
looking at this plot, you might think, okay, it seems like they wouldn't have a car, right? Because that kind of matches the pattern of everybody else around them. So that's a whole concept of this nearest neighbors is you look at, okay, what's around you. And then you're basically like, okay, I'm going to take the label of the majority that's around me. So the first thing that we have to do is we have to define a distance function. And a lot of times in, you know, 2d plots like this, our distance function is something known as
Euclidean distance. And Euclidean distance is basically just this straight line distance like this. Okay. So this would be the Euclidean distance, it seems like there's this point, there's this point, there's that point, etc. So the length of this line, this green line that I just drew, that is what's known as Euclidean distance. If we want to get technical with that, this exact formula is the distance here, let me zoom in. The distance is equal to the square root of one point x minus the other points x squared plus extend that square root, the same thing
for y. So y one of one minus y two of the other squared. Okay, so we're basically trying to find the length, the distances, the difference between x and y, and then square each of those sum it up and take the square root. Okay, so I'm going to erase this so it doesn't clutter my drawing. But anyways, now going back to this plot, so here in the nearest neighbor algorithm, we see that there is a K, right? And this K is basically telling us, okay, how many neighbors do we use in order to judge what
the label is? So usually, we use a K of maybe, you know, three or five, depends on how big our data set is. But here, I would say, maybe a logical number would be three or five. So let's say that we take K to be equal to three. Okay, well, of this data point that I drew over here, let me use green to highlight this. Okay, so of this data point that I drew over here, it looks like the three closest points are definitely this one, this one. And then this one has a length of
four. And this one seems like it'd be a little bit further than four. So actually, this would be these would be our three points. Well, all those points are blue. So chances are, my prediction for this point is going to be blue, it's going to be probably don't have a car. All right, now what if my point is somewhere? What if my point is somewhere over here, let's say that a couple has four kids, and they make 240,000 a year. All right, well, now my closest points are this one, probably a little bit over that
one. And then this one, right? Okay, still all pluses. Well, this one is more than likely to be plus. Right? Now, let me get rid of some of these just so that it looks a little bit more clear. All right, let's go through one more. What about a point that might be right here? Okay, let's see. Well, definitely this is the closest, right? This one's also closest. And then it's really close between the two of these. But if we actually do the mathematics, it seems like if we zoom in, this one is right here. And
this one is in between these two. So this one here is actually shorter than this one. And that means that that top one is the one that we're going to take. Now, what is the majority of the points that are close by? Well, we have one plus here, we have one plus here, and we have one minus here, which means that the pluses are the majority. And that means that this label is probably somebody with a car. Okay. So this is how K nearest neighbors would work. It's that simple. And this can be extrapolated to
further dimensions to higher dimensions. You know, if you have here, we have two different features, we have the income, and then we have the number of kids. But let's say we have 10 different features, we can expand our distance function so that it includes all 10 of those dimensions, we take the square root of everything, and then we figure out which one is the closest to the point that we desire to classify. Okay. So that's K nearest neighbors. So now we've learned about K nearest neighbors. Let's see how we would be able to do that
within our code. So here, I'm going to label the section K nearest neighbors. And we're actually going to use a package from SK learn. So the reason why we, you know, use these packages and so that we don't have to manually code all these things ourselves, because it would be really difficult. And chances are the way that we would code it, either would have bugs, or it'd be really slow, or I don't know a whole bunch of issues. So what we're going to do is hand it off to the pros. From here, I can say,
okay, from SK learn, which is this package dot neighbors, I'm going to import K neighbors classifier, because we're classifying. Okay, so I run that. And our KNN model is going to be this K neighbors classifier. And we can pass in a parameter of how many neighbors, you know, we want to use. So first, let's see what happens if we just use one. So now if I do K, and then model dot fit, I can pass in my x training set and my weight y train data. Okay. So that effectively fits this model. And let's get
all the predictions. So why can and I guess yeah, let's do y predictions. And my y predictions are going to be cannon model dot predict. So let's use the test set x test. Okay. Alright, so if I call y predict, you'll see that we have those. But if I get my truth values for that test set, you'll see that this is what we actually do. So just looking at this, we got five out of six of them. Okay, great. So let's actually take a look at something called the classification report that's offered by SK learn.
So if I go to from SK learn dot metrics, import classification report, what I can actually do is say, hey, print out this classification report for me. And let's check, you know, I'm giving you the y test and the y prediction. We run this and we see we get this whole entire chart. So I'm going to tell you guys a few things on this chart. Alright, this accuracy is 82%, which is actually pretty good. That's just saying, hey, if we just look at, you know, what each of these new points, what it's closest to, then
we actually get an 82% accuracy, which means how many do we get right versus how many total are there. Now, precision is saying, okay, you might see that we have it for class one, or class zero and class one. What precision is saying was, let's go to this Wikipedia diagram over here, because I actually kind of like this diagram. So here, this is our entire data set. And on the left over here, we have everything that we know is positive. So everything that is actually truly positive, that we've labeled positive in our original data set.
And over here, this is everything that's truly negative. Now in the circle, we have things that are positive that were labeled positive by our model. On the left here, we have things that are truly positive, because you know, this side is the positive side and the side is the negative side. So these are truly positive. Whereas all these ones out here, well, they should have been positive, but they are labeled as negative. And in here, these are the ones that we've labeled positive, but they're actually negative. And out here, these are truly negative. So precision
is saying, okay, out of all the ones we've labeled as positive, how many of them are true positives? And recall is saying, okay, out of all the ones that we know are truly positive, how many do we actually get right? Okay, so going back to this over here, our precision score, so again, precision, out of all the ones that we've labeled as the specific class, how many of them are actually that class, it's 7784%. Now, recall how out of all the ones that are actually this class, how many of those that we get, this is
68% and 89%. Alright, so not too shabby, we can clearly see that this recall and precision for like this, the class zero is worse than class one. Right? So that means for hadron, it's worked for hadrons and for our gammas. This f1 score over here is kind of a combination of the precision and recall score. So we're actually going to mostly look at this one because we have an unbalanced test data set. So here we have a measure of 72 and 87 or point seven two and point eight seven, which is not too shabby. All
right. Well, what if we, you know, made this three. So we actually see that, okay, so what was it originally with one? We see that our f1 score, you know, is now it was point seven two and then point eight seven. And then our accuracy was 82%. So if I change that to three. Alright, so we've kind of increased zero at the cost of one and then our overall accuracy is 81. So let's actually just make this five. Alright, so you know, again, very similar numbers, we have 82% accuracy, which is pretty decent for a
model that's relatively simple. Okay, the next type of model that we're going to talk about is something known as naive Bayes. Now, in order to understand the concepts behind naive Bayes, we have to be able to understand conditional probability and Bayes rule. So let's say I have some sort of data set that's shown in this table right here. People who have COVID are over here in this red row. And people who do not have COVID are down here in this green row. Now, what about the COVID test? Well, people who have tested positive are over
here in this column. And people who have tested negative are over here in this column. Okay. Yeah, so basically, our categories are people who have COVID and test positive, people who don't have COVID, but test positive, so a false false positive, people who have COVID and test negative, which is a false negative, and people who don't have COVID and test negative, which good means you don't have COVID. Okay, so let's make this slightly more legible. And here, in the margins, I've written down the sums of whatever it's referring to. So this here is the sum
of this entire row. And this here might be the sum of this column over here. Okay. So the first question that I have is, what is the probability of having COVID given that you have a positive test? And in probability, we write that out like this. So the probability of COVID given, so this line, that vertical line means given that, you know, some condition, so given a positive test, okay, so what is the probability of having COVID given a positive test? So what this is asking is saying, okay, let's go into this condition. So the
condition of having a positive test, that is this slice of the data, right? That means if you're in this slice of data, you have a positive test. So given that we have a positive test, given in this condition, in this circumstance, we have a positive test. So what's the probability that we have COVID? Well, if we're just using this data, the number of people that have COVID is 531. So I'm gonna say that there's 531 people that have COVID. And then now we divide that by the total number of people that have a positive test,
which is 551. Okay, so that's the probability and doing a quick division, we get that this is equal to around 96.4%. So according to this data set, which is data that I made up off the top of my head, so it's not actually real COVID data. But according to this data, the probability of having COVID given that you tested positive is 96.4%. Alright, now with that, let's talk about Bayes rule, which is this section here. Let's ignore this bottom part for now. So Bayes rule is asking, okay, what is the probability of some event A
happening, given that B happened. So this, we already know has happened. This is our condition, right? Well, what if we don't have data for that, right? Like, what if we don't know what the probability of A given B is? Well, Bayes rule is saying, okay, well, you can actually go and calculate it, as long as you have a probability of B given A, the probability of A and the probability of B. Okay. And this is just a mathematical formula for that. Alright, so here we have Bayes rule. And let's actually see Bayes rule in action.
Let's use it on an example. So here, let's say that we have some disease statistics, okay. So not COVID different disease. And we know that the probability of obtaining a false positive is 0.05 probability of obtaining a false negative is 0.01. And the probability of the disease is 0.1. Okay, what is the probability of the disease given that we got a positive test? Hmm, how do we even go about solving this? So what what do I mean by false positive? What's a different way to rewrite that? A false positive is when you test positive, but
you don't actually have the disease. So this here is a probability that you have a positive test given no disease, right? And similarly for the false negative, it's a probability that you test negative given that you actually have the disease. So if I put that into a chart, for example, and this might be my positive and negative tests, and this might be my diseases, disease and no disease. Well, the probability that I test positive, but actually have no disease, okay, that's 0.05 over here. And then the false negatives up here for 0.01. So I'm testing
negative, but I don't actually have the disease. This so the probability that you test positive, and you don't have the disease, plus a probability that you test negative, given that you don't have the disease, that should sum up to one. Okay, because if you don't have the disease, then you should have some probability that you're testing positive and some probability that you're testing negative. But that probability, in total should be one. So that means that the probability negative and no disease, this should be the reciprocal, this should be the opposite. So it should be 0.95
because it's one minus whatever this probability is. And then similarly, oops, up here, this should be 0.99 because the probability that we, you know, test negative and have the disease plus the probability that we test positive and have the disease should equal one. So this is our probability chart. And now, this probability of disease being point 0.1 just means I have 10% probability of actually of having the disease, right? Like, in the general population, the probability that I have the disease is 0.1. Okay, so what is the probability that I have the disease given that
I got a positive test? Well, remember that we can write this out in terms of Bayes rule, right? So if I use this rule up here, this is the probability of a positive test given that I have the disease times the probability of the disease divided by the probability of the evidence, which is my positive test. Alright, now let's plug in some numbers for that. The probability of having a positive test given that I have the disease is 0.99. And then the probability that I have the disease is this value over here 0.1. Okay. And
then the probability that I have a positive test at all should be okay, what is the probability that I have a positive test given that I actually have the disease and then having having the disease. And then the other case, where the probability of me having a negative test given or sorry, positive test giving no disease times the probability of not actually having a disease. Okay, so I can expand that probability of having a positive test out into these two different cases, I have a disease, and then I don't. And then what's the probability of
having positive tests in either one of those cases. So that expression would become 0.99 times 0.1 plus 0.05. So that's the probability that I'm testing positive, but don't have the disease. And the times the probability that I don't actually have the disease. So that's one minus 0.1 probability that the population doesn't have the disease is 90%. So 0.9. And let's do that multiplication. And I get an answer of 0.6875 or 68.75%. Okay. All right, so we can actually expand that we can expand Bayes rule and apply it to classification. And this is what we call
naive base. So first, a little terminology. So the posterior is this over here, because it's asking, Hey, what is the probability of some class CK? So by CK, I just mean, you know, the different categories, so C for category or class or whatever. So category one might be cats, category two, dogs, category three, lizards, all the way, we have k categories, k is just some number. Okay. So what is the probability of having of this specific sample x, so this is our feature vector of this one sample. What is the probability of x fitting into
category 123 for whatever, right, so that that's what this is asking, what is the probability that, you know, it's actually from this class, given all this evidence that we see the x's. So the likelihood is this quantity over here, it's saying, Okay, well, given that, you know, assume, assume we are, assume that this class is class CK, okay, assume that this is a category. Well, what is the likelihood of actually seeing x, all these different features from that category. And then this here is the prior. So like in the entire population of things, what are
the probabilities? What is the probability of this class in general? Like if I have, you know, in my entire data set, what is the percentage? What is the chance that this image is a cat? How many cats do I have? Right. And then this down here is called the evidence because what we're trying to do is we're changing our prior, we're creating this new posterior probability built upon the prior by using some sort of evidence, right? And that evidence is a probability of x. So that's some vocab. And this here is a rule for naive
Bayes. Whoa, okay, let's digest that a little bit. Okay. So what is let me use a different color. What is this side of the equation asking? It's asking, what is the probability that we are in some class K, CK, given that, you know, this is my first input, this is my second input, this is, you know, my third, fourth, this is my nth input. So let's say that our classification is, do we play soccer today or not? Okay, and let's say our x's are, okay, is it how much wind is there? How much rain is
there? And what day of the week is it? So let's So let's say that it's raining, it's not windy, but it's Wednesday, do we play soccer? Do we not? So let's use Bayes rule on this. So this here is equal to the probability of x one, x two, all these joint probabilities, given class K times the probability of that class, all over the probability of this evidence. Okay. So what is this fancy symbol over here, this means proportional to so how our equal sign means it's equal to this like little squiggly sign means that this
is proportional to okay, and this denominator over here, you might notice that it has no impact on the class like this, that number doesn't depend on the class, right? So this is going to be constant for all of our different classes. So what I'm going to do is make things simpler. So I'm just going to say that this probability x one, x two, all the way to x n, this is going to be proportional to the numerator, I don't care about the denominator, because it's the same for every single class. So this is proportional to
x one, x two, x n given class K times the probability of that class. Okay. All right. So in naive Bayes, the point of it being naive, is that we're actually this joint probability, we're just assuming that all of these different things are all independent. So in my soccer example, you know, the probability that we're playing soccer, or the probability that, you know, it's windy, and it's rainy, and, and it's Wednesday, all these things are independent, we're assuming that they're independent. So that means that I can actually write this part of the equation here as
this. So each term in here, I can just multiply all of them together. So the probability of the first feature, given that it's class K, times the probability of the second feature and given this problem, like class K all the way up all the way up until, you know, the nth feature of given that it's class K. So this expands to all of this. All right, which means that this here is now proportional to the thing that we just expanded times this. So I'm going to write that out. So the probability of that class. And
I'm actually going to use this symbol. So what this means is it's a huge multiplication, it means multiply everything to the right of this. So this probability x, given some class K, but do it for all the i's. So I, what is I, okay, we're going to go from the first the first x i all the way to the nth. So that means for every single i, we're just multiplying these probabilities together. And that's where this up here comes from. So to wrap this up, oops, this should be a line to wrap this up in
plain English. Basically, what this is saying is a probability that you know, we're in some category, given that we have all these different features is proportional to the probability of that class in general, times the probability of each of those features, given that we're in this one class that we're testing. So the probability of it, you know, of us playing soccer today, given that it's rainy, not windy, and and it's Wednesday, is proportional to Okay, well, what is what is the probability that we play soccer anyways, and then times the probability that it's rainy, given
that we're playing soccer, times the probability that it's not windy, given that we're playing soccer. So how many times are we playing soccer when it's windy, how you know, and then how many times are what's the probability that's Wednesday, given that we're playing soccer. Okay. So how do we use this in order to make a classification. So that's where this comes in our y hat, our predicted y is going to be equal to something called the arg max. And then this expression over here, because we want to take the arg max. Well, we want. So
okay, if I write out this, again, this means the probability of being in some class CK given all of our evidence. Well, we're going to take the K that maximizes this expression on the right. That's what arc max means. So if K is in zero, oops, one through K, so this is how many categories are, we're going to go through each K. And we're going to solve this expression over here and find the K that makes that the largest. Okay. And remember that instead of writing this, we have now a formula, thanks to Bayes rule
for helping us approximate that right in something that maybe we can we maybe we have like the evidence for that, we have the answers for that based on our training set. So this principle of going through each of these and finding whatever class whatever category maximizes this expression on the right, this is something known as MAP for short, or maximum a posteriori. Pick the hypothesis. So pick the K that is the most probable so that we minimize the probability of misclassification. Right. So that is MAP. That is naive Bayes. Back to the notebook. So just
like how I imported k nearest neighbor, k neighbors classifier up here for naive Bayes, I can go to SK learn naive Bayes. And I can import Gaussian naive Bayes. Right. And here I'm going to say my naive Bayes model is equal. This is very similar to what we had above. And I'm just going to say with this model, we are going to fit x train and y train. All right, just like above. So this, I might actually, so I'm going to set that. And exactly, just like above, I'm going to make my prediction. So here,
I'm going to instead use my naive Bayes model. And of course, I'm going to run the classification report again. So I'm actually just going to put these in the same cell. But here we have the y the new y prediction and then y test is still our original test data set. So if I run this, you'll see that. Okay, what's going on here, we get worse scores, right? Our precision, for all of them, they look slightly worse. And our, you know, for our precision, our recall, our f1 score, they look slightly worse for all the
different categories. And our total accuracy, I mean, it's still 72%, which is not too shabby. But it's still 72%. Okay. Which, you know, is not not that great. Okay, so let's move on to logistic regression. Here, I've drawn a plot, I have y. So this is my label on one axis. And then this is maybe one of my features. So let's just say I only have one feature in this case, text zero, right? Well, we see that, you know, I have a few of one class type down here. And we know it's one class type
because it's zero. And then we have our other class type one up here. And then we have our y. Okay. So many of you guys are familiar with regression. So let's start there. If I were to draw a regression line through this, it might look something like like this. Right? Well, this doesn't seem to be a very good model. Like, why would we use this specific line to predict why? Right? It's, it's iffy. Okay. For example, we might say, okay, well, it seems like, you know, everything from here downwards would be one class type in
here, upwards would be another class type. But when you look at this, you're just you, you visually can tell, okay, like, that line doesn't make sense. Things are not those dots are not along that line. And the reason is because we are doing classification, not regression. Okay. Well, first of all, let's start here, we know that this model, if we just use this line, it equals m x. So whatever this let's just say it's x plus b, which is the y intercept, right? And m is the slope. But when we use a linear regression, is
it actually y hat? No, it's not right. So when we're working with linear regression, what we're actually estimating in our model is a probability, what's a probability between zero and one, that is class zero or class one. So here, let's rewrite this as p equals m x plus b. Okay, well, m x plus b, that can range, you know, from negative infinity to infinity, right? For any for any value of x, it goes from negative infinity to infinity. But probability, we know probably one of the rules of probability is that probability has to stay between
zero and one. So how do we fix this? Well, maybe instead of just setting the probability equal to that, we can set the odds equal to this. So by that, I mean, okay, let's do probability divided by one minus the probability. Okay, so now becomes this ratio. Now this ratio is allowed to take on infinite values. But there's still one issue here. Let me move this over a bit. The one issue here is that m x plus b, that can still be negative, right? Like if you know, I have a negative slope, if I have
a negative b, if I have some negative x's in there, I don't know, but that can be that's allowed to be negative. So how do we fix that? We do that by actually taking the log of the odds. Okay. So now I have the log of you know, some probability divided by one minus the probability. And now that is on a range of negative infinity to infinity, which is good because the range of log should be negative infinity to infinity. Now how do I solve for P the probability? Well, the first thing I can do
is take, you know, I can remove the log by taking the not the e to the whatever is on both sides. So that gives me the probability over the one minus the probability is now equal to e to the m x plus b. Okay. So let's multiply that out. So the probability is equal to one minus probability e to the m x plus b. So P is equal to e to the m x plus b minus P times e to the m x plus b. And now we have we can move like terms to one
side. So if I do P, so basically, I'm moving this over, so I'm adding P. So now P one plus e to the m x plus b is equal to e to the m x plus b and let me change this parentheses make it a little bigger. So now my probability can be e to the m x plus b divided by one plus e to the m x plus b. Okay, well, let me just rewrite this really quickly, I want a numerator of one on top. Okay, so what I'm going to do is I'm going
to multiply this by negative m x plus b, and then also the bottom by negative m x plus b, and I'm allowed to do that because this over this is one. So now my probability is equal to one over one plus e to the negative m x plus b. And now why did I rewrite it like that? It's because this is actually a form of a special function, which is called the sigmoid function. And for the sigmoid function, it looks something like this. So s of x sigmoid, you know, that some x is equal to
one over one plus e to the negative x. So essentially, what I just did up here is rewrite this in some sigmoid function, where the x value is actually m x plus b. So maybe I'll change this to y just to make that a bit more clear, it doesn't matter what the variable name is. But this is our sigmoid function. And visually, what our sigmoid function looks like is it goes from zero. So this here is zero to one. And it looks something like this curved s, which I didn't draw too well. Let me try
that again. It's hard to draw something if I can draw this right. Like that. Okay, so it goes in between zero and one. And you might notice that this form fits our shape up here. Oops, let's draw it sharper. But if it's our shape up there a lot better, right? Alright, so that is what we call logistic regression, we're basically trying to fit our data to the sigmoid function. Okay. And when we only have, you know, one data point, so if we only have one feature x, and that's what we call simple logistic regression. But
then if we have, you know, so that's only x zero, but then if we have x zero, x one, all the way to x n, we call this multiple logistic regression, because there are multiple features that we're considering when we're building our model, logistic regression. So I'm going to put that here. And again, from SK learn this linear model, we can import logistic regression. All right. And just like how we did above, we can repeat all of this. So here, instead of NB, I'm going to call this log model, or LG logistic regression. I'm going
to change this to logistic regression. So I'm just going to use the default logistic regression. But actually, if you look here, you see that you can use different penalties. So right now we're using an L2 penalty. But L2 is our quadratic formula. Okay, so that means that for, you know, outliers, it would really penalize that. For all these other things, you know, you can toggle these different parameters, and you might get slightly different results. If I were building a production level logistic regression model, then I would want to go and I would want to figure
out how to do that. So I'm going to go ahead and I'm going to go ahead and I would want to figure out, you know, what are the best parameters to pass into here, based on my validation data. But for now, we'll just we'll just use this out of the box. So again, I'm going to fit the X train and the Y train. And I'm just going to predict again, so I can just call this again. And instead of LG, NB, I'm going to use LG. So here, this is decent precision 65% recall 71, f
168, or 82 total accuracy of 77. Okay, so it performs slightly better than I base, but it's still not as good as K and N. Alright, so the last model for classification that I wanted to talk about is something called support vector machines, or SVMs for short. So what exactly is an SVM model, I have two different features x zero and x one on the axes. And then I've told you if it's you know, class zero or class one based on the blue and red labels, my goal is to find some sort of line between
these two labels that best divides the data. Alright, so this line is our SVM model. So I call it a line here because in 2d, it's a line, but in 3d, it would be a plane and then you can also have more and more dimensions. So the proper term is actually I want to find the hyperplane that best differentiates these two classes. Let's see a few examples. Okay, so first, between these three lines, let's say A, B, and C, and C, which one is the best divider of the data, which one has you know, all
the data on one side or the other, or at least if it doesn't, which one divides it the most, right, like which one is has the most defined boundary between the two different groups. So this this question should be pretty straightforward. It should be a right because a has a clear distinct line between where you know, everything on this side of a is one label, it's negative and everything on this side of a is the other label, it's positive. So what if I have a but then what if I had drawn my B like this,
and my C, maybe like this, sorry, they're kind of the labels are kind of close together. But now which one is the best? So I would argue that it's still a, right? And why is it still a? Right? And why is it still a? Because in these other two, look at how close this is to that, to these points. Right? So if I had some new point that I wanted to estimate, okay, say I didn't have A or B. So let's say we're just working with C. Let's say I have some new point that's right
here. Or maybe a new point that's right there. Well, it seems like just logically looking at this. I mean, without the boundary, that would probably go under the positives, right? I mean, it's pretty close to that other positive. So one thing that we care about in SVM is something known as the margin. Okay, so not only do we want to separate the two classes really well, we also care about the boundary in between where the points in those classes in our data set are, and the line that we're drawing. So in a line like this,
the closest values to this line might be like here. And I'm trying to draw these perpendicular. Right? And so this effectively, if I switch over to these dotted lines, if I can draw this right. So these effectively are what's known as the margins. Okay, so these both here, these are our margins in our SVMs. And our goal is to maximize those margins. So not only do we want the line that best separates the two different classes, we want the line that has the largest margin. And the data points that lie on the margin lines, the
data. So basically, these are the data points that's helping us define our divider. These are what we call support vectors. Hence the name support vector machines. Okay, so the issue with SVM sometimes is that they're not so robust to outliers. Right? So for example, if I had one outlier, like this up here, that would totally change where I want my support vector to be, even though that might be my only outlier. Okay. So that's just something to keep in mind. As you know, when you're working with SVM is, it might not be the best model
if there are outliers in your data set. Okay, so another example of SVMs might be, let's say that we have data like this, I'm just going to use a one dimensional data set for this example. Let's say we have a data set that looks like this. Well, our, you know, separators should be perpendicular to this line. But it should be somewhere along this line. So it could be anywhere like this. You might argue, okay, well, there's one here. And then you could also just draw another one over here, right? And then maybe you can have
two SVMs. But that's not really how SVMs work. But one thing that we can do is we can create some sort of projection. So I realize here that one thing I forgot to do was to label where zero was. So let's just say zero is here. Now, what I'm going to do is I'm going to say, okay, I'm going to have x, and then I'm going to have x, sorry, x zero and x one. So x zero is just going to be my original x. But I'm going to make x one equal to let's say,
x squared. So whatever is this squared, right? So now, my natives would be, you know, maybe somewhere here, here, just pretend that it's somewhere up here. Right. And now my pluses might be something like that. And I'm going to run out of space over here. So I'm just going to draw these together, use your imagination. But once I draw it like this, well, it's a lot easier to apply a boundary, right? Now our SVM could be maybe something like this, this. And now you see that we've divided our data set. Now it's separable where one
class is this way. And the other class is that way. Okay, so that's known as SVMs. I do highly suggest that, you know, any of these models that we just mentioned, if you're interested in them, do go more in depth mathematically into them. Like how do we how do we find this hyperplane? Right? I'm not going to go over that in this specific course, because you're just learning what an SVM is. But it's a good idea to know, oh, okay, this is the technique behind finding, you know, what exactly are the are the how do
you define the hyperplane that we're going to use. So anyways, this transformation that we did down here, this is known as the kernel trick. So when we go from x to some coordinate x, and then x squared, what we're doing is we are applying a kernel. So that's why it's called the kernel trick. So SVMs are actually really powerful. And you'll see that here. So from sk learn.svm, we are going to import SVC. And SVC is our support vector classifier. So with this, so with our SVM model, we are going to, you know, create SVC
model. And we are going to, again, fit this to X train, I could have just copied and pasted this, I should be able to do that. So we're going to create SVC again, fit this to X train, I could have just copied and pasted this, I should have probably done that. Okay, taking a bit longer. All right. Let's predict using RSVM model. And here, let's see if I can hover over this. Right. So again, you see a lot of these different parameters here that you can go back and change if you were creating a production
level model. Okay, but in this specific case, we'll just use it out of the box again. So if I make predictions, you'll note that Wow, the accuracy actually jumps to 87% with the SVM. And even with class zero, there's nothing less than, you know, point eight, which is great. And for class one, I mean, everything's at 0.9, which is higher than anything that we had seen to this point. So so far, we've gone over four different classification models, we've done SVM, logistic regression, naive Bayes and cannon. And these are just simple ways on how to
implement them. Each of these they have different, you know, they have different hyper parameters that you can go and you can toggle. And you can try to see if that helps later on or not. But for the most part, they perform, they give us around 70 to 80% accuracy. Okay, with SVM being the best. Now, let's see if we can actually beat that using a neural net. Now the final type of model that I wanted to talk about is known as a neural net or neural network. And neural nets look something like this. So you
have an input layer, this is where all your features would go. And they have all these arrows pointing to some sort of hidden layer. And then all these arrows point to some sort of output layer. So what is what is all this mean? Each of these layers in here, this is something known as a neuron. Okay, so that's a neuron. In a neural net. These are all of our features that we're inputting into the neural net. So that might be x zero x one all the way through x n. Right. And these are the features
that we talked about there, they might be you know, the pregnancy, the BMI, the age, etc. Now all of these get weighted by some value. So they are multiplied by some w number that applies to that one specific category that one specific feature. So these two get multiplied. And the sum of all of these goes into that neuron. Okay, so basically, I'm taking w zero times x zero. And then I'm adding x one times w one and then I'm adding you know, x two times w two, etc, all the way to x n times w
n. And that's getting input into the neuron. Now I'm also adding this bias term, which just means okay, I might want to shift this by a little bit. So I might add five or I might add 0.1 or I might subtract 100, I don't know. But we're going to add this bias term. And the output of all these things. So the sum of this, this, this and this, go into something known as an activation function, okay. And then after applying this activation function, we get an output. And this is what a neuron would look like.
Now a whole network of them would look something like this. So I kind of gloss over this activation function. What exactly is that? This is how a neural net looks like if we have all our inputs here. And let's say all of these arrows represent some sort of addition, right? Then what's going on is we're just adding a bunch of times, right? We're adding the some sort of weight times these input layer a bunch of times. And then if we were to go back and factor that all out, then this entire neural net is just
a linear combination of these input layers, which I don't know about you, but that just seems kind of useless, right? Because we could literally just write that out in a formula, why would we need to set up this entire neural network, we wouldn't. So the activation function is introduced, right? So without an activation function, this just becomes a linear model. An activation function might look something like this. And as you can tell, these are not linear. And the reason why we introduce these is so that our entire model doesn't collapse on itself and become a
linear model. So over here, this is something known as a sigmoid function, it runs between zero and one, tanh runs between negative one all the way to one. And this is ReLU, which anything less than zero is zero, and then anything greater than zero is linear. So with these activation functions, every single output of a neuron is no longer just the linear combination of these, it's some sort of altered linear state, which means that the input into the next neuron is, you know, it doesn't it doesn't collapse on itself, it doesn't become linear, because we've
introduced all these nonlinearities. So this is a training set, the model, the loss, right? And then we do this thing called training, where we have to feed the loss back into the model, and make certain adjustments to the model to improve this predicted output. Let's talk a little bit about the training, what exactly goes on during that step. Let's go back and take a look at our L2 loss function. This is what our L2 loss function looks like it's a quadratic formula, right? Well, up here, the error is really, really, really, really large. And our
goal is to get somewhere down here, where the loss is decreased, right? Because that means that our predicted value is closer to our true value. So that means that we want to go this way. Okay. And thanks to a lot of properties of math, something that we can do is called gradient descent, in order to follow this slope down this way. This quadratic is, it has different different slopes with respect to some value. Okay, so the loss with respect to some weight w zero, versus w one versus w n, they might all be different. Right?
So some way that I kind of think about it is, to what extent is this value contributing to our loss. And we can actually figure that out through some calculus, which we're not going to touch up on in this specific course. But if you want to learn more about neural nets, you should probably also learn some calculus and figure out what exactly back propagation is doing, in order to actually calculate, you know, how much do we have to backstep by. So the thing is here, you might notice that this follows this curve at all of
these different points. And the closer we get to the bottom, the smaller this step becomes. Now stick with me here. So my new value, this is what we call a weight update, I'm going to take w zero, and I'm going to set some new value for w zero. And what I'm going to set for that is the old value of w zero, plus some factor, which I'll just call alpha for now, times whatever this arrow is. So that's basically saying, okay, take our old w zero, our old weight, and just decrease it this way. So
I guess increase it in this direction, right, like take a step in this direction. But this alpha here is telling us, okay, don't don't take a huge step, right, just in case we're wrong, take a small step, take a small step in that direction, see if we get any closer. And for those of you who, you know, do want to look more into the mathematics of things, the reason why I use a plus here is because this here is the negative gradient, right, if this were just the if you were to use the actual gradient,
this should be a minus. Now this alpha is something that we call the learning rate. Okay, and that adjusts how quickly we're taking steps. And that might, you know, tell our that that will ultimately control how long it takes for our neural net to converge. Or sometimes if you set it too high, it might even diverge. But with all of these weights, so here I have w zero, w one, and then w n. We make the same update to all of them after we calculate the loss, the gradient of the loss with respect to that
weight. So that's how back propagation works. And that is everything that's going on here. After we calculate the loss, we're calculating gradients, making adjustments in the model. So we're setting all the all the weights to something adjusted slightly. And then we're going to calculate the gradient. And then we're saying, Okay, let's take the training set and run it through the model again, and go through this loop all over again. So for machine learning, we already have seen some libraries that we use, right, we've already seen SK learn. But when we start going into neural networks,
this is kind of what we're trying to program. And it's not very fun to try to do this from scratch, because not only will we probably have a lot of bugs, but also probably not going to be fast enough, right? Wouldn't it be great if there are just some, you know, full time professionals that are dedicated to solving this problem, and they could literally just give us their code that's already running really fast? Well, the answer is, yes, that exists. And that's why we use TensorFlow. So TensorFlow makes it really easy to define these models.
But we also have enough control over what exactly we're feeding into this model. So for example, this line here is basically saying, Okay, let's create a sequential neural net. So sequential is just, you know, what we've seen here, it just goes one layer to the next. And a dense layer means that a dense layer means that all of them are interconnected. So here, this is interconnected with all of these nodes, and this one's all these, and then this one gets connected to all of the next ones, and so on. So we're going to create 16
dense nodes with relu activation functions. And then we're going to create another layer of 16 dense nodes with relu activation. And then our output layer is going to be just one node. Okay. And that's how easy it is to define something in TensorFlow. So TensorFlow is an open source library that helps you develop and train your ML models. Let's implement this for a neural net. So we're using a neural net for classification. Now, so our neural net model, we are going to use TensorFlow, and I don't think I imported that up here. So we are
going to import that down here. So I'm going to import TensorFlow as TF. And enter. Cool. So my neural net model is going to be, I'm going to use this. So essentially, this is saying layer all these things that I'm about to pass in. So yeah, layer them linear stack of layers, layer them as a model. And what that means, nope, not that. So what that means is I can pass in some sort of layer, and I'm just going to use a dense layer. Oops, dot dense. And let's say we have 32 units. Okay, I
will also set the activation as really. And at first we have to specify the input shape. So here we have 10, and comma. Alright. Alright, so that's our first layer. Now our next layer, I'm just going to have another dense layer of 32 units all using relu. And that's it. So for the final layer, this is just going to be my output layer, it's going to just be one node. And the activation is going to be sigmoid. So if you recall from our logistic regression, what happened there was when we had a sigmoid, it looks
something like this, right? So by creating a sigmoid activation on our last layer, we're essentially projecting our predictions to be zero or one, just like in logistic regression. And that's going to help us, you know, we can just round to zero or one and classify that way. Okay. So this is my neural net model. And I'm going to compile this. So in TensorFlow, we have to compile it. It's really cool, because I can just literally pass in what type of optimizer I want, and it'll do it. So here, if I go to optimizers, I'm actually
going to use atom. And you'll see that, you know, the learning rate is 0.001. So I'm just going to use that default. So 0.001. And my loss is going to be binary cross entropy. And the metrics that I'm also going to include on here, so it already will consider loss, but I'm, I'm also going to tack on accuracy. So we can actually see that in a plot later on. Alright, so I'm going to run this. And one thing that I'm going to also do is I'm going to define these plot definitions. So I'm actually copying
and pasting this, I got these from TensorFlow. So if you go on to some TensorFlow tutorial, they actually have these, this like, defined. And that's exactly what I'm doing here. So I'm actually going to move this cell up, run that. So we're basically plotting the loss over all the different epochs. epochs means like training cycles. And we're going to run that. So means like training cycles. And we're going to plot the accuracy over all the epochs. Alright, so we have our model. And now all that's left is, let's train it. Okay. So I'm going to
say history. So TensorFlow is great, because it keeps track of the history of the training, which is why we can go and plot it later on. Now I'm going to set that equal to this neural net model. And fit that with x train, y train, I'm going to make the number of epochs equal to let's say just let's just use 100 for now. And the batch size, I'm going to set equal to, let's say 32. Alright. And the validation split. So what the validation split does, if it's down here somewhere. Okay, so yeah, this validation
split is just the fraction of the training data to be used as validation data. So essentially, every single epoch, what's going on is TensorFlow saying, leave certain if this is point two, then leave 20% out. And we're going to test how the model performs on that 20% that we've left out. Okay, so it's basically like our validation data set. But TensorFlow does it on our training data set during the training. So we have now a measure outside of just our validation data set to see, you know, what's going on. So validation split, I'm going to
make that 0.2. And we can run this. So if I run that, all right, and I'm actually going to set verbose equal to zero, which means, okay, don't print anything, because printing something for 100 epochs might get kind of annoying. So I'm just going to let it run, let it train, and then we'll see what happens. Cool, so it finished training. And now what I can do is because you know, I've already defined these two functions, I can go ahead and I can plot the loss, oops, loss of that history. And I can also plot
the accuracy throughout the training. So this is a little bit ish what we're looking for. We definitely are looking for a steadily decreasing loss and an increasing accuracy. So here we do see that, you know, our validation accuracy improves from around point seven, seven or something all the way up to somewhere around point, maybe eight one. And our loss is decreasing. So this is good. It is expected that the validation loss and accuracy is performing worse than the training loss or accuracy. And that's because our model is training on that data. So it's adapting to
that data. Whereas the validation stuff is, you know, stuff that it hasn't seen yet. So, so that's why. So in machine learning, as we saw above, we could change a bunch of the parameters, right? Like I could change this to 64. So now it'd be a row of 64 nodes, and then 32, and then one. So I can change some of these parameters. And a lot of machine learning is trying to find, hey, what do we set these hyper parameters to? So what I'm actually going to do is I'm going to rewrite this so that
we can do something what's known as a grid search. So we can search through an entire space of hey, what happens if, you know, we have 64 nodes and 64 nodes, or 16 nodes and 16 nodes, and so on. And then on top of all that, we can, you know, we can change this learning rate, we can change how many epochs we can change, you know, the batch size, all these things might affect our training. And just for kicks, I'm also going to add what's known as a dropout layer in here. And what dropout is
doing is saying, hey, randomly choose with at this rate, certain nodes, and don't train them in, you know, in a certain iteration. So this helps prevent overfitting. Okay, so I'm actually going to define this as a function called train model, we're going to pass in x train, y train, the number of nodes, the dropout, you know, the probability that we just talked about learning rate. So I'm actually going to say lr batch size. And we can also pass in number epochs, right? I mentioned that as a parameter. So indent this, so it goes under here.
And with these two, I'm going to set this equal to number of nodes. And now with the two dropout layers, I'm going to set dropout prob. So now you know, the probability of turning off a node during the training is equal to dropout prob. And I'm going to keep the output layer the same. Now I'm compiling it, but this here is now going to be my learning rate. And I still want binary cross entropy and accuracy. We are actually going to train our model inside of this function. But here we can do the epochs equal
epochs, and this is equal to whatever, you know, we're passing in x train, y train belong right here. Okay, so those are getting passed in as well. And finally, at the end, I'm going to return this model and the history of that model. Okay. So now what I'll do is let's just go through all of these. So let's say let's keep epochs at 100. And now what I can do is I can say, hey, for a number of nodes in, let's say, let's do 1632 and 64, to see what happens for the different dropout probabilities.
And I mean, zero would be nothing. Let's use 0.2. Also, to see what happens. You know, for the learning rate in 0.005, 0.001. And you know, maybe we want to throw on 0.1 in there as well. And then for the batch size, let's do 1632, 64 as well. Actually, and let's also throw in 128. Actually, let's get rid of 16. Sorry, so 128 in there. That should be 01. I'm going to record the model and history using this train model here. So we're going to do x train y train, the number of nodes is going
to be, you know, the number of nodes that we've defined here, dropout, prob, LR, batch size, and epochs. Okay. And then now we have both the model and the history. And what I'm going to do is again, I want to plot the loss for the history. I'm also going to plot the accuracy. Probably should have done them side by side, that probably would have been easier. Okay, so what I'm going to do is split up, split this up. And that will be the subplots. So now this is just saying, okay, I want one row and
two columns in that row for my plots. Okay, so I'm going to plot on my axis one, the loss. I don't actually know this is going to work. Okay, we don't care about the grid. Yeah, let's let's keep the grid. And then now my other. So now on here, I'm going to plot all the accuracies on the second plot. I might have to debug this a bit. We should be able to get rid of that. If we run this, we already have history saved as a variable in here. So if I just run it on
this, okay, it has no attribute x label. Oh, I think it's because it's like set x label or something. Okay, yeah, so it's, it's set instead of just x label, y label. So let's see if that works. All right, cool. Um, and let's actually make this a bit larger. Okay, so we can actually change the figure size that I'm gonna set. Let's see what happens if I set that to. Oh, that's not the way I wanted it. Okay, so that looks reasonable. And that's just going to be my plot history function. So now I can
plot them side by side. Here, I'm going to plot the history. And what I'm actually going to do is I so here, first, I'm going to print out all these parameters. So I'm going to print out the F string to print out all of this stuff. So here, I'm going to print out all these parameters. Uh, all of this stuff. So here, I'm printing out how many nodes, um, the dropout probability, uh, the learning rate. And we already know how many you found, so I'm not even going to bother with that. So once we plot
this, uh, let's actually also figure out what the, um, what the validation losses on our validation set that we have that we created all the way back up here. Alright, so remember, we created three data sets. Let's call our model and evaluate what the validation data with the validation data sets loss would be. And I actually want to record, let's say I want to record whatever model has the least validation loss. So first, I'm going to initialize that to infinity so that you know, any model will beat that score. So if I do float infinity,
that will set that to infinity. And maybe I'll keep track of the parameters. Actually, it doesn't really matter. I'm just going to keep track of the model. And I'm gonna set that to none. So now down here, if the validation loss is ever less than the least validation loss, then I am going to simply come down here and say, Hey, this validation for this least validation loss is now equal to the validation loss. And the least loss model is whatever this model is that just earned that validation loss. Okay. So we are actually just going
to let this run for a while. And then we're going to get our least last model after that. So let's just run. All right, and now we wait. All right, so we've finally finished training. And you'll notice that okay, down here, the loss actually gets to like 0.29. The accuracy is around 88%, which is pretty good. So you might be wondering, okay, why is this accuracy in this? Like, these are both the validation. So this accuracy here is on the validation data set that we've defined at the beginning, right? And this one here, this is
actually taking 20% of our tests, our training set every time during the training, and saying, Okay, how much of it do I get right now? You know, after this one step where I didn't train with any of that. So they're slightly different. And actually, I realized later on that I probably you know, probably what I should have done is over here, when we were defining the model fit, instead of the validation split, you can define the validation data. And you can pass in the validation data, I don't know if this is the proper syntax. But
that's probably what I should have done. But instead, you know, we'll just stick with what we have here. So you'll see at the end, you know, with the 64 nodes, it seems like this is our best performance 64 nodes with a dropout of 0.2, a learning rate of 0.001, and a batch size of 64. And it does seem like yes, the validation, you know, the fake validation, but the validation loss is decreasing, and then the accuracy is increasing, which is a good sign. Okay, so finally, what I'm going to do is I'm actually just going
to predict. So I'm going to take this model, which we've called our least loss model, I'm going to take this model, and I'm going to predict x test on that. And you'll see that it gives me some values that are really close to zero and some that are really close to one. And that's because we have a sigmoid output. So if I do this, and what I can do is I can cast them. So I'm going to say anything that's greater than 0.5, set that to one. So if I actually, I think what happens if
I do this? Oh, okay, so I have to cast that as type. And so now you'll see that it's ones and zeros. And I'm actually going to transform this into a column as well. So here I'm going to Oh, oops, I didn't I didn't mean to do that. Okay, no, I wanted to just reshape it to that. So now it's one dimensional. Okay. And using that we can actually just rerun the classification report based on these this neural net output. And you'll see that okay, the the F ones are the accuracy gives us 87%. So
it seems like what happened here is the precision on class zero. So the hadrons has increased a bit, but the recall decreased. But the F one score is still at a good point eight one. And for the other class, it looked like the precision decreased a bit the recall increased for an overall F one score. That's also been increased. I think I interpreted that properly. I mean, we went through all this work and we got a model that performs actually very, very similarly to the SVM model that we had earlier. And the whole point of
this exercise was to demonstrate, okay, these are how you can define your models. But it's also to say, hey, maybe, you know, neural nets are very, very powerful, as you can tell. But sometimes, you know, an SVM or some other model might actually be more appropriate. But in this case, I guess it didn't really matter which one we use at the end. An 87% accuracy score is still pretty good. So yeah, let's now move on to regression. We just saw a bunch of different classification models. Now let's shift gears into regression, the other type of
supervised learning. If we look at this plot over here, we see a bunch of scattered data points. And here we have our x value for those data points. And then we have the corresponding y value, which is now our label. And when we look at this plot, well, our goal in regression is to find the line of best fit that best models this data. Essentially, we're trying to let's say we're given some new value of x that we don't have in our sample, we're trying to say, okay, what would my prediction for y be for
that given x value. So that, you know, might be somewhere around there. I don't know. But remember, in regression that, you know, given certain features, we're trying to predict some continuous numerical value for y. In linear regression, we want to take our data and fit a linear model to this data. So in this case, our linear model might look something along the lines of here. Right. So this here would be considered as maybe our line of best fit. And this line is modeled by the equation, I'm going to write it down here, y equals b
zero, plus b one x. Now b zero just means it's this y intercept. So if we extend this y down here, this value here is b zero, and then b one defines the source of the line, defines the slope of this line. Okay. All right. So that's the that's the formula for linear regression. And how exactly do we come up with that formula? What are we trying to do with this linear regression? You know, we could just eyeball where the line be, but humans are not very good at eyeballing certain things like that. I mean,
we can get close, but a computer is better at giving us a precise value for b zero and b one. Well, let's introduce the concept of something known as a residual. Okay, so residual, you might also hear this being called the error. And what that means is, let's take some data point in our data set. And we're going to evaluate how far off is our prediction from a data point that we already have. So this here is our y, let's say, this is 12345678. So this is y eight, let's call it, you'll see that I
use this y i in order to I in order to represent, hey, just one of these points. Okay. So this here is why and this here would be the prediction. Oops, this here would be the prediction for y eight, which I've labeled with this hat. Okay, if it has a hat on it, that means hey, this is what this is my guess this is my prediction for you know, this specific value of x. Okay. Now the residual would be this distance here between y eight and y hat eight. So y eight minus y hat eight.
All right, because that would give us this here. And I'm just going to take the absolute value of this. Because what if it's below the line, right, then you would get a negative value, but distance can't be negative. So we're just going to put a little hat, or we're going to put a little absolute value around this quantity. And that gives us the residual or the error. So let me rewrite that. And you know, to generalize to all the points, I'm going to say the residual can be calculated as y i minus y hat of
i. Okay. So this just means the distance between some given point, and its prediction, its corresponding prediction on the line. So now, with this residual, this line of best fit is generally trying to decrease these residuals as much as possible. So now that we have some value for the error, our line of best fit is trying to decrease the error as much as possible for all of the different data points. And that might mean, you know, minimizing the sum of all the residuals. So this here, this is the sum symbol. And if I just stick
the residual calculation in there, it looks something like that, right. And I'm just going to say, okay, for all of the eyes in our data set, so for all the different points, we're going to sum up all the residuals. And I'm going to try to decrease that with my line of best fit. So I'm going to find the B0 and B1, which gives me the lowest value of this. Okay. Now in other, you know, sometimes in different circumstances, we might attach a squared to that. So we're trying to decrease the sum of the squared residuals.
And what that does is it just, you know, it adds a higher penalty for how far off we are from, you know, points that are further off. So that is linear regression, we're trying to find this equation, some line of best fit that will help us decrease this measure of error with respect to all the data points that we have in our data set, and try to come up with the best prediction for all of them. This is known as simple linear regression. And basically, that means, you know, our equation looks something like this. Now,
there's also multiple linear regression, which just means that hey, if we have more than one value for x, so like think of our feature vectors, we have multiple values in our x vector, then our predictor might look something more like this. Actually, I'm just going to say etc, plus b n, x n. So now I'm coming up with some coefficient for all of the different x values that I have in my vector. Now you guys might have noticed that I have some assumptions over here. And you might be asking, okay, Kylie, what in the world
do these assumptions mean? So let's go over them. So let's go over them. The first one is linearity. And what that means is, let's say I have a data set. Okay. Linearity just means, okay, my does my data follow a linear pattern? Does y increase as x increases? Or does y decrease at as x increases? Does so if y increases or decreases at a constant rate as x increases, then you're probably looking at something linear. So what's the example of a nonlinear data set? Let's say I had data that might look something like that. Okay.
So now just visually judging this, you might say, okay, seems like the line of best fit might actually be some curve like this. Right. And in this case, we don't satisfy that linearity assumption anymore. So with linearity, we basically just want our data set to follow some sort of linear trajectory. And independence, our second assumption just means this point over here, it should have no influence on this point over here, or this point over here, or this point over here. So in other words, all the points, all the samples in our data set should be
independent. Okay, they should not rely on one another, they should not affect one another. Okay, now, normality and homoscedasticity, those are concepts which use this residual. Okay. So if I have a plot that looks something like this, and I have a plot that looks like this. Okay, something like this. And my line of best fit is somewhere here, maybe it's something like that. In order to look at these normality and homoscedasticity assumptions, let's look at the residual plot. Okay. And what that means is I'm going to keep my same x axis. But instead of plotting
now where they are relative to this y, I'm going to plot these errors. So now I'm going to plot y minus y hat like this. Okay. And now you know, this one is slightly positive, so it might be here, this one down here is negative, it might be here. So our residual plot, it's literally just a plot of how you know, the values are distributed around our line of best fit. So it looks like it might, you know, look something like this. Okay. So this might be our residual plot. And what normality means, so our
assumptions are normality and homoscedasticity, I might have butchered that spelling, I don't really know. But what normality is saying is saying, okay, these residuals should be normally distributed. Okay, around this line of best fit, it should follow a normal distribution. And now what homoscedasticity says, okay, our variants of these points should remain constant throughout. So this spread here should be approximately the same as this spread over here. Now, what's an example of where you know, homoscedasticity is not held? Well, let's say that our original plot actually looks something like this. Okay, so now if we
looked at the residuals for that, it might look something like that. And now if we look at this spread of the points, it decreases, right? So now the spread is not constant, which means that homoscedasticity, this assumption would not be fulfilled, and it might not be appropriate to use linear regression. So that's just linear regression. Basically, we have a bunch of data points, we want to predict some y value for those. And we're trying to come up with this line of best fit that best describes, hey, given some value x, what would be my best
guess of what y is. So let's move on to how do we evaluate a linear regression model. So the first measure that I'm going to talk about is known as mean absolute error, or MAE for short, okay. And mean absolute error is basically saying, all right, let's take all the errors. So all these residuals that we talked about, let's sum up the distance for all of them, and then take the average. And then that can describe, you know, how far off are we. So the mathematical formula for that would be, okay, let's take all the
residuals. Alright, so this is the distance. Actually, let me redraw a plot down here. So suppose I have a data set, look like this. And here are all my data points, right. And now let's say my line looks something like that. So my mean absolute error would be summing up all of these values. This was a mistake. So summing up all of these, and then dividing by how many data points I have. So what would be all the residuals, it would be y i, right, so every single point, minus y hat i, so the prediction
for that on here. And then we're going to sum over all of all of the different i's in our data set. Right, so i, and then we divide by the number of points we have. So actually, I'm going to rewrite this to make it a little clearer. So i is equal to whatever the first data point is all the way through the nth data point. And then we divide it by n, which is how many points there are. Okay, so this is our measure of mae. And this is basically telling us, okay, in on average,
this is the distance between our predicted value and the actual value in our training set. Okay. And mae is good because it allows us to, you know, when we get this value here, we can literally directly compare it to whatever units the y value is in. So let's say y is we're talking, you know, the prediction of the price of a house, right, in dollars. Once we have once we calculate the mae, we can literally say, oh, the average, you know, price, the average, how much we're off by is literally this many dollars. Okay. So
that's the mean absolute error. An evaluation technique that's also closely related to that is called the mean squared error. And this is MSE for short. Okay. Now, if I take this plot again, and I duplicated and move it down here, well, the gist of mean squared error is kind of the same, but instead of the absolute value, we're going to square. So now the MSE is something along the lines of, okay, let's sum up something, right, so we're going to sum up all of our errors. So now I'm going to do y i minus y
hat i. But instead of absolute valuing them, I'm going to square them all. And then I'm going to divide by n in order to find the mean. So basically, now I'm taking all of these different values, and I'm squaring them first before I add them to one another. And then I divide by n. And the reason why we like using mean squared error is that it helps us punish large errors in the prediction. And later on, MSE might be important because of differentiability, right? So a quadratic equation is differentiable, you know, if you're familiar with
calculus, a quadratic equation is differentiable, whereas the absolute value function is not totally differentiable everywhere. But if you don't understand that, don't worry about it, you won't really need it right now. And now one downside of mean squared error is that once I calculate the mean squared error over here, and I go back over to y, and I want to compare the values. Well, it gets a little bit trickier to do that because now my mean squared error is in terms of y squared, right? It's this is now squared. So instead of just dollars, how,
you know, how many dollars off am I I'm talking how many dollars squared off am I. And that, you know, to humans, it doesn't really make that much sense. Which is why we have created something known as the root mean squared error. And I'm just going to copy this diagram over here because it's very, very similar to mean squared error. Except now we take a big squared root. Okay, so this is our messy, and we take the square root of that mean squared error. And so now the term in which you know, we're defining our
error is now in terms of that dollar sign symbol again. So that's a pro of root mean squared error is that now we can say, okay, our error according to this metric is this many dollar signs off from our predictor. Okay, so it's in the same unit, which is one of the pros of root mean squared error. And now finally, there is the coefficient of determination, or r squared. And this is a formula for r squared. So r squared is equal to one minus RSS over TSS. Okay, so what does that mean? Basically, RSS stands
for the sum of the squared residuals. So maybe it should be SSR instead, but RSS sum of the squared residuals, and this is equal to if I take the sum of all the values, and I take y i minus y hat, i, and square that, that is my RSS, right, it's a sum of the squared residuals. Now TSS, let me actually use a different color for that. So TSS is the total sum of squares. And what that means is that instead of being with respect to this prediction, we are instead going to take each y
value and just subtract the mean of all the y values, and square that. Okay, so if I drew this out, and if this were my actually, let's use a different color. Let's use green. If this were my predictor, so RSS is giving me this measure here, right? It's giving me some estimate of how far off we are from our regressor that we predicted. Actually, I'm gonna take this one, and I'm gonna take this one, and actually, I'm going to use red for that. Well, TSS, on the other hand, is saying, okay, how far off are
these values from the mean. So if we literally didn't do any calculations for the line of best fit, if we just took all the y values and average all of them, and said, hey, this is the average value for every single x value, I'm just going to predict that average value instead, then it's asking, okay, how far off are all these points from that line? Okay, and remember that this square means that we're punishing larger errors, right? So even if they look somewhat close in terms of distance, the further a few data points are, then
the further the larger our total sum of squares is going to be. Sorry, that was my dog. So the total sum of squares is taking all of these values and saying, okay, what is the sum of squares, if I didn't do any regressor, and I literally just calculated the average of all the y values in my data set, and for every single x value, I'm just going to predict that average, which means that okay, like, that means that maybe y and x aren't associated with each other at all. Like the best thing that I can
do for any new x value, just predict, hey, this is the average of my data set. And this total sum of squares is saying, okay, well, with respect to that average, what is our error? Right? So up here, the sum of the squared residuals, this is telling us what is our what what is our error with respect to this line of best fit? Well, our total sum of squares saying what is the error with respect to, you know, just the average y value. And if our line of best fit is a better fit, then this
total sum of squares, that means that you know, this numerator, that means that this numerator is going to be smaller than this denominator, right? And if our errors in our line of best fit are much smaller, then that means that this ratio of the RSS over TSS is going to be very small, which means that R squared is going to go towards one. And now when R squared is towards one, that means that that's usually a sign that we have a good predictor. It's one of the signs, not the only one. So over here, I
also have, you know, that there's this adjusted R squared. And what that does, it just adjusts for the number of terms. So x1, x2, x3, etc. It adjusts for how many extra terms we add, because usually when we, you know, add an extra term, the R squared value will increase because that'll help us predict y some more. But the value for the adjusted R squared increase if the new term actually improves this model fit more than expected, you know, by chance. So that's what adjusted R squared is. I'm not, you know, it's out of the
scope of this one specific course. And now that's linear regression. Basically, I've covered the concept of residuals or errors. And, you know, how do we use that in order to find the line of best fit? And you know, our computer can do all the calculations for us, which is nice. But behind the scenes, it's trying to minimize that error, right? And then we've gone through all the different ways of actually evaluating a linear regression model and the pros and cons of each one. So now let's look at an example. So we're still on supervised learning.
But now we're just going to talk about regression. So what happens when you don't just want to predict, you know, type 123? What happens if you actually want to predict a certain value? So again, I'm on the UCI machine learning repository. And here I found this data set about bike sharing in Seoul, South Korea. So this data set is predicting rental bike count. And here it's the kind of bikes rented at each hour. So what we're going to do, again, you're going to go into the data folder, and you're going to download this CSV file.
And we're going to move over to collab again. And here I'm going to name this FCC bikes and regression. I don't remember what I called the last one. But yeah, FCC bikes regression. Now I'm going to import a bunch of the same things that I did earlier. And, you know, I'm going to also continue to import the oversampler and the standard scaler. And then I'm actually also just going to let you guys know that I have a few more things I wanted import. So this is a library that lets us copy things. Seaborn is a
wrapper over a matplotlib. So it also allows us to plot certain things. And then just letting you know that we're also going to be using TensorFlow. Okay, so one more thing that we're also going to be using, we're going to use the sklearn linear model library. Actually, let me make my screen a little bit bigger. So yeah, awesome. Run this and that'll import all the things that we need. So again, I'm just going to, you know, give some credit to where we got this data set. So let me copy and paste this UCI thing. And
I will also give credit to this here. Okay, cool. All right, cool. So this is our data set. And again, it tells us all the different attributes that we have right here. So I'm actually going to go ahead and paste this in here. Feel free to copy and paste this if you want me to read it out loud, so you can type it. It's byte count, hour, temp, humidity, wind, visibility, dew point, temp, radiation, rain, snow, and functional, whatever that means. Okay, so I'm going to come over here and import my data by dragging and
dropping. All right. Now, one thing that you guys might actually need to do is you might actually have to open up the CSV because there were, at first, a few like forbidding characters in mine, at least. So you might have to get rid of like, I think there was a degree here, but my computer wasn't recognizing it. So I got rid of that. So you might have to go through and get rid of some of those labels that are incorrect. I'm going to do this. Okay. But after we've done that, we've imported in here, I'm
going to create a data a data frame from that. So, all right, so now what I can do is I can read that CSV file and I can get the data into here. So so like data dot CSV. Okay, so now if I call data dot head, you'll see that I have all the various labels, right? And then I have the data in there. So I'm going to from here, I'm actually going to get rid of some of these columns that, you know, I don't really care about. So here, I'm going to, when I when
I type this in, I'm going to drop maybe the date, whether or not it's a holiday, and the various seasons. So I'm just not going to care about these things. Access equals one means drop it from the columns. So now you'll see that okay, we still have, I mean, I guess you don't really notice it. But if I set the data frames columns equal to data set calls, and I look at, you know, the first five things, then you'll see that this is now our data set. It's a lot easier to read. So another thing
is, I'm actually going to df functional. And we're going to create this. So remember that our computers are not very good at language, we want it to be in zeros and ones. So here, I will convert that. Well, if this is equal to yes, then that that gets mapped as one. So then set type integer. All right. Great. Cool. So the thing is, right now, these by counts are for whatever hour. So to make this example simpler, I'm just going to index on an hour, and I'm gonna say, okay, we're only going to use that
specific hour. So I'm just going to index on an hour, and I'm going to use an hour. So here, let's say. So this data frame is only going to be data frame where the hour, let's say it equals 12. Okay, so it's noon. All right. So now you'll see that all the equal to 12. And I'm actually going to now drop that column. Our access equals one. Alright, so we run this cell. Okay, so now we got rid of the hour in here. And we just have the by count, the temperature, humidity, wind, visibility, and
yada, yada, yada. Alright, so what I want to do is I'm going to actually plot all of these. So for i in all the columns, so the range, length of whatever its data frame is, and all the columns, because I don't have by count as actually, it's my first thing. So what I'm going to do is say for a label in data frame, columns, everything after the first thing, so that would give me the temperature and onwards. So these are all my features, right? I'm going to just scatter. So I want to see how that
label how that specific data, how that affects the by count. So I'm going to plot the bike count on the y axis. And I'm going to plot, you know, whatever the specific label is on the x axis. And I'm going to title this, whatever the label is. And, you know, make my y label, the bike count at noon. And the x label as just the label. Okay, now, I guess we don't even need the legend. We don't even need the legend. So just show that plot. All right. So it seems like functional is not really
doesn't really give us any utility. So then snow rain seems like this radiation, you know, is fairly linear dew point temperature, visibility, wind doesn't really seem like it does much humidity, kind of maybe like an inverse relationship. But the temperature definitely looks like there's a relationship between that and the number of bikes, right. So what I'm actually going to do is I'm going to drop some of the ones that don't don't seem like they really matter. So maybe wind, you know, visibility. Yeah, so I'm going to get rid of when visibility and functional. So now
data frame, and I'm going to drop wind, visibility, and functional. All right. And the axis again is the column. So that's one. So if I look at my data set, now, I have just the temperature, the humidity, the dew point temperature, radiation, rain, and snow. So again, what I want to do is I want to split this into my training, my validation and my test data set, just as we talked before. Here, we can use the exact same thing that we just did. And we can say numpy dot split, and sample, you know that the
whole sample, and then create our splits of the data frame. And we're going to do that. But now set this to eight. Okay. So I don't really care about, you know, the the full grid, the full array. So I'm just going to use an underscore for that variable. But I will get my training x and y's. And actually, I don't have a function for getting the x and y's. So here, I'm going to write a function defined, get x y. And I'm going to pass in the data frame. And I'm actually going to pass in
what the name of the y label is, and what the x what specific x labels I want to look at. So here, if that's none, then I'm just like, like, I'm only going to I'm going to get everything from the data set. That's not the wildlife. So here, I'm actually going to make first a deep copy of my data frame. And that basically means I'm just copying everything over. If, if like x labels is none, so if not x labels, then all I'm going to do is say, all right, x is going to be whatever
this data frame is. And I'm just going to take all the columns. So C for C, and data frame, dot columns, if C does not equal the y label, right, and I'm going to get the values from that. But if there is the x labels, well, okay, so in order to index only one thing, so like, let's say I pass in only one thing in here, then my data frame is, so let me make a case for that. So if the length of x labels is equal to one, then what I'm going to do is
just say that this is going to be x labels, and add that just that label values, and I actually need to reshape to make this 2d. So I'm going to pass in negative one comma one there. Now, otherwise, if I have like a list of specific x labels that I want to use, then I'm actually just going to say x is equal to data frame of those x labels, dot values. And that should suffice. Alright, so now that's just me extracting x. And in order to get my y, I'm going to do y equals data
frame, and then passing the y label. And at the very end, I'm going to say data equals NP dot h stack. So I'm stacking them horizontally one next to each other. And I'll take x and y, and return that. Oh, but this needs to be values. And I'm actually going to reshape this to make it 2d as well so that we can do this h stack. And I will return data x, y. So now I should be able to say, okay, get x, y, and take that data frame. And the y label, so my y
label is byte count. And actually, so for the x label, I'm actually going to let's just do like one dimension right now. And earlier, I got rid of the plots, but we had seen that maybe, you know, the temperature dimension does really well. And we might be able to use that to predict why. So I'm going to label this also that, you know, it's just using the temperature. And I am also going to do this again for, oh, this should be train. And this should be validation. And this should be a test. Because oh, that's
Val. Right. But here, it should be Val. And this should be test. Alright, so we run this and now we have our training validation and test data sets for just the temperature. So if I look at x train temp, it's literally just the temperature. Okay, and I'm doing this first to show you simple linear regression. Alright, so right now I can create a regressor. So I can say the temp regressor here. And then I'm going to, you know, make a linear regression model. And just like before, I can simply fix fit my x train temp,
y train temp in order to train train this linear regression model. Alright, and then I can also, I can print this regressor is coefficients and the intercept. So if I do that, okay, this is the coefficient for whatever the temperature is, and then the the x intercept, okay, or the y intercept, sorry. All right. And I can, you know, score, so I can get the the r squared score. So I can score x test and y test. All right, so it's an r squared of around point three eight, which is better than zero, which would
mean, hey, there's absolutely no association. But it's also not, you know, like, good, it depends on the context. But, you know, the higher that number, it means the higher that the two variables would be correlated, right? Which here, it's all right. It just means there's maybe some association between the two. But the reason why I want to do this one D was to show you, you know, if we plotted this, this is what it would look like. So if I create a scatterplot, and let's take the training. So this is our data. And then let's
make it blue. And then if I also plotted, so something that I can do is say, you know, the x range, I'm going to plot it, is when space, and this goes from negative 20 to 40, this piece of data. So I'm going to just say, let's take 100 things from there. So I'm going to plot x, and I'm going to take this temper, this, like, regressor, and predict x with that. Okay, and this label, I'm going to label that the fit. And this color, let's make this red. And let's actually set the line with,
so I can, I can change how thick that value is. Okay. Now at the very end, let's create a legend. And let's, all right, let's also create, you know, title, all these things that matter, in some sense. So here, let's just say, this would be the bikes, versus the temperature, right? And the y label would be number of bikes. And the x label would be the temperature. So I actually think that this might cause an error. Yeah. So it's expecting a 2d array. So we actually have to reshape this. Okay, there we go. So I
just had to make this an array and then reshape it. So it was 2d. Now, we see that, all right, this increases. But again, remember those assumptions that we had about linear regression, like this, I don't really know if this fits those assumptions, right? I just wanted to show you guys though, that like, all right, this is what a line of s fit through this data would look like. Okay. Now, we can do multiple linear regression, right. So I'm going to go ahead and do that as well. Now, if I take my data set, and
instead of the labels, it's actually what's my current data set right now. Alright, so let's just use all of these except for the byte count, right. So I'm going to just say for the x labels, let's just take the data frames columns and just remove the byte count. So does that work? So if this part should be of x labels is none. And then this should work now. Oops, sorry. Okay, so I have Oh, but this here, because it's not just the temperature anymore, we should actually do this, let's say all, right. So I'm just
going to quickly rerun this piece here so that we have our temperature only data set. And now we have our all data set. Okay. And this regressor, I can do the same thing. So I can do the all regressor. And I'm going to make this the linear regression. And I'm going to fit this to x train all and y train all. Okay. Alright, so let's go ahead and also score this regressor. And let's see how the R squared performs now. So if I test this on the test data set, what happens? Alright, so our R
square seems to improve it went from point four to point five, two, which is a good sign. Okay. And I can't necessarily plot, you know, every single dimension. But this just this is just to say, okay, this is this is improved, right? Alright, so one cool thing that you can do with tensorflow is you can actually do regression, but with the neural net. So here, I'm going to we already have our our training data for just the temperature and just, you know, for all the different columns. So I'm not going to bother with splitting up
the data again, I'm just going to go ahead and start building the model. So in this linear regression model, typically, you know, it does help if we normalize it. So that's very easy to do with tensorflow, I can just create some normalizer layer. So I'm going to do tensorflow Keras layers, and get the normalization layer. And the input shape for that will just be one because let's just do it again on just the temperature and the access I will make none. Now for this temp normalizer, and I should have had an equal sign there. I'm
going to adapt this to X train temp, and reshape this to just a single vector. So that should work great. Now with this model, so temp neural net model, what I can do is I can do, you know, dot keras, sequential. And I'm going to pass in this normalizer layer. And then I'm going to say, hey, just give me one single dense layer with one single unit. And what that's doing is saying, all right, well, one single node just means that it's linear. And if you don't add any sort of activation function to it, the
output is also linear. So here, I'm going to have tensorflow Keras layers dot dense. And I'm just going to have one unit. And that's going to be my model. Okay. So with this model, let's compile. And for our optimizer, let's use, let's use the atom again, dot atom, and we have to pass in the learning rate. So learning rate, and our learning rate, let's do 0.01. And now, the loss, we actually let's get this one 0.1. And the loss, I'm going to do mean squared error. Okay, so we run that we've compiled it, okay, great.
And just like before, we can call history. And I'm going to fit this model. So here, if I call fit, I can just fit it, and I'm going to take the x train with the temperature, but reshape it. Y train for the temperature. And I'm going to set verbose equal to zero so that it doesn't, you know, display stuff. I'm actually going to set epochs equal to, let's do 1000. And the validation data should be let's pass in the validation data set here as a tuple. And I know I spelled that wrong. So let's just
run this. And up here, I've copied and pasted the plot loss from our previous but changed the y label to MSC. Because now we're talking we're dealing with mean squared error. And I'm going to plot the loss of this history after it's done. So let's just wait for this to finish training and then to plot. Okay, so this actually looks pretty good. We see that the value is still the same. So this actually looks pretty good. We see that the values are converging. So now what I can do is I'm going to go back up
and take this plot. And we are going to just run that plot again. So here, instead of this temperature regressor, I'm going to use the neural net regressor. This neural net model. And if I run that, I can see that, you know, this also gives me a linear regressor, you'll notice that this this fit is not entirely the same as the one up here. And that's due to the training process of, you know, of this neural net. So just two different ways to try and try to find the best linear regressor. Okay, but here we're
using back propagation to train a neural net node, whereas in the other one, they probably are not doing that. Okay, they're probably just trying to actually compute the line of s fit. So, okay, given this, well, we can repeat the exact same exercise with our with our multiple linear regressions. Okay, but I'm actually going to skip that part. I will leave that as an exercise to the viewer. Okay, so now what would happen if we use a neural net, a real neural net instead of just, you know, one single node in order to predict this.
So let's start on that code, we already have our normalizer. So I'm actually going to take the same setup here. But instead of, you know, this one dense layer, I'm going to set this equal to 32 units. And for my activation, I'm going to use Relu. And now let's duplicate that. And for the final output, I just want one answer. So I just want one cell. And this activation is also going to be Relu, because I can't ever have less than zero bytes. So I'm just going to set that as Relu. I'm just going to
name this the neural net model. Okay. And at the bottom, I'm going to have this neural net model. I'm going to have this neural net model, I'm going to compile. And I will actually use the same compiler here. But instead of instead of a learning rate of 0.01, I'll use 0.001. Okay. And I'm going to train this here. So the history is this neural net model. And I'm going to fit that against x train temp, y train temp, and valid validation data, I'm going to set this again equal to x val temp, and y val
temp. Now, for the verbose, I'm going to say equal to zero epochs, let's do 100. And here for the batch size, actually, let's just not do a batch size right now. Let's just try it. Let's see what happens here. And again, we can plot the loss of this history after it's done training. So let's just run this. And that's not what we're supposed to get. So what is going on? Here is sequential, we have our temperature normalizer, which I'm wondering now if we have to redo that. Do that. Okay, so we do see this decline,
it's an interesting curve, but we do we do see it eventually. So this is our loss, which all right, if decreasing, that's a good sign. And actually, what's interesting is let's just let's plot this model again. So here instead of that. And you'll see that we actually have this like, curve that looks something like this. So actually, what if I got rid of this activation? Let's train this again. And see what happens. Alright, so even even when I got rid of that really at the end, it kind of knows, hey, you know, if it's not
the best model, if we had maybe one more layer in here, these are just things that you have to play around with. When you're, you know, working with machine learning, it's like, you don't really know what the best model is going to be. For example, this also is not brilliant. But I guess it's okay. So my point is, though, that with a neural net, I mean, this is not brilliant, but also there's like no data down here, right? So it's kind of hard for our model to predict. In fact, we probably should have started the
prediction somewhere around here. My point, though, is that with this neural net model, you can see that this is no longer a linear predictor, but yet we still get an estimate of the value, right? And we can repeat this exact same exercise, right? So let's do that. Right. And we can repeat this exact same exercise with the multiple inputs. So here, if I now pass in all of the data, so this is my all normalizer, and I should just be able to pass in that. So let's move this to the next cell. Here, I'm going
to pass in my all normalizer. And let's compile it. Yeah, those parameters look good. Great. So here with the history, when we're trying to fit this model, instead of temp, we're going to use our larger data set with all the features. And let's just train that. And of course, we want to plot the loss. Okay, so that's what our loss looks like. So an interesting curve, but it's decreasing. So before we saw that our R squared score was around point five, two. Well, we don't really have that with a neural net anymore. But one thing
that we can measure is hey, what is the mean squared error, right? So if I come down here, and I compare the two mean squared errors, so so I can predict x test all right. So these are my predictions using that linear regressor, will linear multiple multiple linear regressor. So these are my live predictions, linear regression. Okay. I'm actually going to do that at the bottom. So let me just copy and paste that cell and bring it down here. So now I'm going to calculate the mean squared error for both the linear regressor and the
neural net. Okay, so this is my linear and this is my neural net. So if I do my neural net model, and I predict x test all, I get my two, you know, different y predictions. And I can calculate the mean squared error, right? So if I want to get the mean squared error, and I have y prediction and y real, I can do numpy dot square, and then I would need the y prediction minus, you know, the real. So this this is basically squaring everything. And this should be a vector. So if I just
take this entire thing and take the mean of that, that should give me the MSC. So let's just try that out. And the y real is y test all, right? So that's my mean squared error for the linear regressor. And this is my mean squared error for the neural net. So that's interesting. I will debug this live, I guess. So my guess is that it's probably coming from this normalization layer. Because this input shape is probably just six. And okay, so that works now. And the reason why is because, like, my inputs are only for
every vector, it's only a one dimensional vector of length six. So I should have I should have just had six, comma, which is a tuple of size six from the start, or it's a it's a tuple containing one element, which is a six. Okay, so it's actually interesting that my neural net results seem like they they have a larger mean squared error than my linear regressor. One thing that we can look at is, we can actually plot the real versus, you know, the the actual results versus what the predictions are. So if I say, some
access, and I use plt dot axes, and make axes and make these equal, then I can scatter the the y, you know, the test. So what the actual values are on the x axis, and then what the prediction are on the x axis. Okay. And I can label this as the linear regression predictions. Okay, so then let me just label my axes. So the x axis, I'm going to say is the true values. The y axis is going to be my linear regression predictions. Or actually, let's plot. Let's just make this predictions. And then at
the end, I'm going to plot. Oh, let's set some limits. Because I think that's like approximately the max number of bikes. So I'm going to set my x limit to this and my y limit to this. So here, I'm going to pass that in here too. And all right, this is what we actually get for our linear regressor. You see that actually, they align quite well, I mean, to some extent. So 2000 is probably too much 2500. I mean, looks like maybe like 1800 would be enough here for our limits. And I'm actually going to
label something else, the neural net predictions. Let's add a legend. So you can see that our neural net for the larger values, it seems like it's a little bit more spread out. And it seems like we tend to underestimate a little bit down here in this area. Okay. And for some reason, these are way off as well. But yeah, so we've basically used a linear regressor and a neural net. Honestly, there are sometimes where a neural net is more appropriate and a linear regressor is more appropriate. I think that it just comes with time and
trying to figure out, you know, and just literally seeing like, hey, what works better, like here, a linear, a multiple linear regressor might actually work better than a neural net. But for example, with the one dimensional case, a linear regressor would never be able to see this curve. Okay. I mean, I'm not saying this is a great model either, but I'm just saying like, hey, you know, sometimes it might be more appropriate to use something that's not linear. So yeah, I will leave regression at that. Okay, so we just talked about supervised learning. And in
supervised learning, we have data, we have some a bunch of features and for a bunch of different samples. But each of those samples has some sort of label on it, whether that's a number, a category, a class, etc. Right, we were able to use that label in order to try to predict right, we were able to use that label in order to try to predict new labels of other points that we haven't seen yet. Well, now let's move on to unsupervised learning. So with unsupervised learning, we have a bunch of unlabeled data. And what can
we do with that? You know, can we learn anything from this data? So the first algorithm that we're going to discuss is known as k means clustering. What k means clustering is trying to do is it's trying to compute k clusters from the data. So in this example below, I have a bunch of scattered points. And you'll see that this is x zero and x one on the two axes, which means I'm actually plotting two different features, right of each point, but we don't know what the y label is for those points. And now, just
looking at these scattered points, we can kind of see how there are different clusters in the data set, right. So depending on what we pick for k, we might have different clusters. Let's say k equals two, right, then we might pick, okay, this seems like it could be one cluster, but this here is also another cluster. So those might be our two different clusters. If we have k equals three, for example, then okay, this seems like it could be a cluster. This seems like it could be a cluster. And maybe this could be a cluster,
right. So we could have three different clusters in the data set. Now, this k here is predefined, if I can spell that correctly, by the person who's running the model. So that would be you. All right. And let's discuss how you know, the computer actually goes through and computes the k clusters. So I'm going to write those steps down here. Now, the first step that happens is we actually choose well, the computer chooses three random points on this plot to be the centroids. And by centuries, I just mean the center of the clusters. Okay. So
three random points, let's say we're doing k equals three, so we're choosing three random points to be the centroids of the three clusters. If it were two, we'd be choosing two random points. Okay. So maybe the three random points I'm choosing might be here. Here, here, and here. All right. So we have three different points. And the second thing that we do is we actually calculate the distance for each point to those centroids. So between all the points and the centroid. So basically, I'm saying, all right, this is this distance, this distance, this distance, all
of these distances, I'm computing between oops, not those two, between the points, not the centroids themselves. So I'm computing the distances for all of these plots to each of the centroids. Okay. And that comes with also assigning those points to the closest centroid. What do I mean by that? So let's take this point here, for example, so I'm computing this distance, this distance, and this distance. And I'm saying, okay, it seems like the red one is the closest. So I'm actually going to put this into the red centroid. So if I do that for all
of these points, it seems slightly closer to red, and this one seems slightly closer to red, right? Now for the blue, I actually wouldn't put any blue ones in here, but we would probably actually, that first one is closer to red. And now it seems like the rest of them are probably closer to green. So let's just put all of these into green here, like that. And cool. So now we have, you know, our two, three, technically centroid. So there's this group here, there's this group here. And then blue is kind of just this group
here, it hasn't really touched any of the points yet. So the next step, three that we do is we actually go and we recalculate the centroid. So we compute new centroids based on the points that we have in all the centroids. And by that, I just mean, okay, well, let's take the average of all these points. And where is that new centroid? That's probably going to be somewhere around here, right? The blue one, we don't have any points in there. So we won't touch and then the screen one, we can put that probably somewhere over
here, oops, somewhere over here. Right. So now if I erase all of the previously computed centroids, I can go and I can actually redo step two over here, this calculation. Alright, so I'm going to go back and I'm going to iterate through everything again, and I'm going to recompute my three centroids. So let's see, we're going to take this red point, these are definitely all red, right? This one still looks a bit red. Now, this part, we actually start getting closer to the blues. So this one still seems closer to a blue than a green,
this one as well. And I think the rest would belong to green. Okay, so now our three centroids are three, sorry, our three clusters would be this, this, and then this, right? Those are our three centroids. And so now we go back and we compute the new sorry, those would be the three clusters. So now we go back and we compute the three centroids. So I'm going to get rid of this, this and this. And now where would this red be centered, probably closer, you know, to this point here, this blue might be closer to
up here. And then this green would probably be somewhere. It's pretty similar to what we had before. But it seems like it'd be pulled down a bit. So probably somewhere around there for green. All right. And now, again, we go back and we compute the distance between all the points and the centroids. And then we assign them to the closest centroid. Okay. So the reds are all here, it's very clear. Actually, let me just circle that. And this it actually seems like this point is it actually seemed like this point is closer to this blue
now. So the blues seem like they would be maybe this point looks like it'd be blue. So all these look like they would be blue now. And the greens would probably be this cluster right here. So we go back, we compute the centroids, bam. This one probably like almost here, bam. And then the green looks like it would be probably here ish. Okay. And now we go back and we compute the we compute the clusters again. So red, still this blue, I would argue is now this cluster here. And green is this cluster here. Okay,
so we go and we recompute the centroids, bam, bam. And, you know, bam. And now if I were to go and assign all the points to clusters again, I would get the exact same thing. Right. And so that's when we know that we can stop iterating between steps two and three is when we've converged on some solution when we've reached some stable point. And so now because none of these points are really changing out of their clusters anymore, we can go back to the user and say, Hey, these are our three clusters. Okay. And this
process, something known as expectation maximization. This part where we're assigning the points to the closest centroid, this is something this is our expectation step. And this part where we're computing the new centroids, this is our maximization step. Okay, so that's expectation maximization. And we use this in order to compute the centroids, assign all the points to clusters, according to those centroids. And then we're recomputing all that over again, until we reach some stable point where nothing is changing anymore. Alright, so that's our first example of unsupervised learning. And basically, what this is doing is trying
to find some structure, some pattern in the data. So if I came up with another point, you know, might be somewhere here, I can say, Oh, it looks like that's closer to if this is a, b, c, it looks like that's closest to cluster B. And so I would probably put it in cluster B. Okay, so we can find some structure in the data based on just how, how the points are scattered relative to one another. Now, the second unsupervised learning technique that I'm going to discuss with you guys, something noted, principal component analysis. And
the point of principal component analysis is very often it's used as a dimensionality reduction technique. So let me write that down. It's used for dimensionality reduction. And what do I mean by dimensionality reduction is if I have a bunch of features like x1 x2 x3 x4, etc. Can I just reduce that down to one dimension that gives me the most information about how all these points are spread relative to one another. And that's what PCA is for. So PCA principal component analysis. Let's say I have some points in the x zero and x one feature
space. Okay, so these points might be spread, you know, something like this. Okay. So for example, if this were something to do with housing prices, right, this here might be x zero might be hey, years since built, right, since the house was built, and x one might be square footage of the house. Alright, so like years since built, I mean, like right now it's been, you know, 22 years since a house in 2000 was built. Now principal component analysis is just saying, alright, let's say we want to build a model, or let's say we want
to, you know, display something about our data, but we don't we don't have two axes to show it on. How do we display, you know, how do we how do we demonstrate that this point is a further away from this point than this point. And we can do that using principal component analysis. So take what you know about linear regression and just forget about it for a second. Otherwise, you might get confused. PCA is a way of trying to find direction in the space with the largest variance. So this principal component, what that means is
basically the component. So some direction in this space with the largest variance, okay, it tells us the most about our data set without the two different dimensions. Like, let's say we have these two different mentions, and somebody's telling us, hey, you only get one dimension in order to show your data set. What dimension do you want to show us? Okay, so let's say we want to show our data set, what dimension like what do we do, we want to project our data onto a single dimension. Alright, so that in this case might be a dimension
that looks something like this. And you might say, okay, we're not going to talk about linear regression, okay. We don't have a y value. So linear regression, this would be why this is not why, okay, we don't have a label for that. Instead, what we're doing is we're taking the right angle projection. So all of these take that's not very visible. But take this right angle projection onto this line. And what PCA is doing is saying, okay, map all of these points onto this one dimensional space. So the transformed data set would be here. This
one's on the data sets are on the line. So we just put that there. But now this would be our new one dimensional data set. Okay, it's not our prediction or anything. This is our new data set. If somebody came to us said you only get one dimension, you only get one number to represent each of these 2d points. What number would you give us? What number would you give us? So this would be our new one dimensional data set. Okay, it's not our prediction or anything. What number would you give me? This would be
the number that we gave. Okay, this in this direction, this is where our points are the most spread out. Right? If I took this plot, and let me actually duplicate this so I don't have to rewrite anything. Or so I don't have to erase and then redraw anything. Let me get rid of some of this stuff. And I just got rid of a point there too. So let me draw that back. Alright, so if this were my original data point, what if I had taken, you know, this to be the PCA dimension? Okay, well, I
then would have points that let me actually do that in different color. So if I were to draw a right angle to this for every point, my points would look something like this. And so just intuitively looking at these two different plots, this top one and this one, we can see that the points are squished a little bit closer together. Right? Which means that the variance that's not the space with the largest variance. The thing about the largest variance is that this will give us the most discrimination between all of these points. The larger the
variance, the further spread out these points will likely be. Now, and so that's the that's the dimension that we should project it on a different way to actually look at that, like what is the dimension with the largest variance. It's actually it also happens to be the dimension that decreases to be the dimension that decreases that minimizes the residuals. So if we take all the points, and we take the residual from that the XY residual, so in linear regression, in linear regression, we were looking only at this residual, the differences between the predictions right between
y and y hat, it's not that here in principal component analysis, we're taking the difference from our current point in two dimensional space, and then it's projected point. Okay, so we're taking that dimension. And we're saying, alright, how much, you know, how much distance is there between that projection residual, and we're trying to minimize that for all of these points. So that actually equates to this largest variance dimension, this dimension here, the PCA dimension, you can either look at it as minimizing, minimize, let me get rid of this, the projection residuals. So that's the stuff
in orange. Or to maximizing the variance between the points. Okay. And we're not really going to talk about, you know, the method that we need in order to calculate out the principal components, or like what that projection would be, because you will need to understand linear algebra for that, especially eigenvectors and eigenvalues, which I'm not going to cover in this class. But that's how you would find the principal components. Okay, now, with this two dimensional data set here, sorry, this one dimensional data set, we started from a 2d data set, and we now boil it
down to one dimension. Well, we can go and take that dimension, and we can do other things with it. Right, we can, like if there were a y label, then we can now show x versus y, rather than x zero and x one in different plots with that y. Now we can just say, oh, this is a principal component. And we're going to plot that with the y. Or for example, if there were 100 different dimensions, and you only wanted to take five of them, well, you could go and you could find the top five
PCA dimensions. And that might be a lot more useful to you than 100 different feature vector values. Right. So that's principal component analysis. Again, we're taking, you know, certain data that's unlabeled, and we're trying to make some sort of estimation, like some guess about its structure from that original data set, if we wanted to take, you know, a 3d thing, so like a sphere, but we only have a 2d surface to draw it on. Well, what's the best approximation that we can make? Oh, it's a circle. Right PCA is kind of the same thing. It's
saying if we have something with all these different dimensions, but we can't show all of them, how do we boil it down to just one dimension? How do we extract the most information from that multiple dimensions? And that is exactly either you minimize the projection residuals, or you maximize the variance. And that is PCA. So we'll go through an example of that. Now, finally, let's move on to implementing the unsupervised learning part of this class. Here, again, I'm on the UCI machine learning repository. And I have a seeds data set where, you know, I have
a bunch of kernels that belong to three different types of wheat. So there's comma, Rosa and Canadian. And the different features that we have access to are, you know, geometric parameters of those wheat kernels. So the area perimeter, compactness, length, width, width, asymmetry, and the length of the kernel groove. Okay, so all of these are real values, which is easy to work with. And what we're going to do is we're going to try to predict, or I guess we're going to try to cluster the different varieties of the wheat. So let's get started. I have
a colab notebook open again. Oh, you're gonna have to, you know, go to the data folder, download this. And so I'm going to go to the data folder, download this, and let's get started. So the first thing to do is to import our seeds data set into our colab notebook. So I've done that here. Okay, and then we're going to import all the classics again, so pandas. And then I'm also going to import seedborn because I'm going to want that for this specific class. Okay. Great. So now our columns that we have in our seed
data set are the area, the perimeter, the compactness, the length, with asymmetry, groove, length, I mean, I'm just going to call it groove. And then the class, right, the wheat kernels class. So now we have to import this, I'm going to do that using pandas read CSV. And it's called seeds data.csv. So I'm going to turn that into a data frame. And the names are equal to the columns over here. So what happens if I just do that? Oops, what did I call this seeds data set text? Alright, so if we actually look at our
data frame right now, you'll notice something funky. Okay. And here, you know, we have all the stuff under area. And these are all our numbers with some dash t. So the reason is because we haven't actually told pandas what the separator is, which we can do like this. And this t that's just a tab. So in order to ensure that like all whitespace gets recognized as a separator, we can actually this is for like a space. So any spaces are going to get recognized as data separators. So if I run that, now our this, you
know, this is a lot better. Okay. Okay. So now let's actually go and like visualize this data. So what I'm actually going to do is plot each of these against one another. So in this case, pretend that we don't have access to the class, right? Pretend that so this class here, I'm just going to show you in this example, that like, hey, we can predict our classes using unsupervised learning. But for this example, in unsupervised learning, we don't actually have access to the class. So I'm going to just try to plot these against one another
and see what happens. So for some I in range, you know, the columns minus one because the classes in the columns. And I'm just going to say for j in range, so take everything from I onwards, you know, so I like the next thing after I until the end of this. So this will give us basically a grid of all the different like combinations. And our x label is going to be columns I our y label is going to be the columns j. So those are our labels up here. And I'm going to use seaborne
this time. And I'm going to say scatter my data. So our x is going to be our x label. Or y is going to be our y label. And our data is going to be the data frame that we're passing in. So what's interesting here is that we can say hue. And what this will do is say, like if I give this class, it's going to separate the three different classes into three different hues. So now what we're doing is we're basically comparing the area and the perimeter or the area and the compactness. But we're
going to visualize, you know, what classes they're in. So let's go ahead and I might have to show. So great. So basically, we can see perimeter and area we give we get these three groups. The area compactness, we get these three groups, and so on. So these all kind of look honestly like somewhat similar. Right, so Wow, look at this one. So this one, we have the compactness and the asymmetry. And it looks like there's not really I mean, it just looks like they're blobs, right? Sure, maybe class three is over here more, but one
and two kind of look like they're on top of each other. Okay. I mean, there are some that might look slightly better in terms of clustering. But let's go through some of the some of the clustering examples that we talked about, and try to implement those. The first thing that we're going to do is just straight up clustering. So what we learned about was k means clustering. So from SK learn, I'm going to import k means. Okay. And just for the sake of being able to run, you know, any x and any y, I'm just
going to say, hey, let's use some x. What's a good one, maybe. I mean, perimeter asymmetry could be a good one. So x could be perimeter, y could be asymmetry. Okay. And for this, the x values, I'm going to just extract those specific values. Alright, well, let's make a k means algorithm, or let's, you know, define this. So k means, and in this specific case, we know that the number of clusters is three. So let's just use that. And I'm going to fit this against this x that I've just defined right here. Right. So, you
know, if I create this clusters, so one thing, one cool thing is I can actually go to this clusters, and I can say k mean dot labels. And it'll give give me if I can type correctly, it'll give me what its predictions for all the clusters are. And our actual, oops, not that. If we go to the data frame, and we get the class, and the values from those, we can actually compare these two and say, hey, like, you know, everything in general, most of the zeros that it's predicted, are the ones, right. And in
general, the twos are the twos here. And then this third class one, okay, that corresponds to three. Now remember, these are separate classes. So the labels, what we actually call them don't really matter. We can say a map zero to one map two to two and map one to three. Okay, and our, you know, our mapping would do fairly well. But we can actually visualize this. And in order to do that, I'm going to create this cluster cluster data frame. So I'm going to create a data frame. And I'm going to pass in a horizontally
stacked array with x, so my values for x and y. And then the clusters that I have here, but I'm going to reshape them. So it's 2d. Okay. And the columns, the labels for that are going to be x, y, and plus. Okay. So I'm going to go ahead and do that same seaborne scatter plot. Again, where x is x, y is y. And now, the hue is again the class. And the data is now this cluster data frame. Alright, so this here, this here is my k means like, I guess classes. So k means
kind of looks like this. If I come down here and I plot, you know, my original data frame, this is my original classes with respect to this specific x and y. And you'll see that, honestly, like it doesn't do too poorly. Yeah, there's I mean, the colors are different, but that's fine. For the most part, it gets information of the clusters, right. And now we can do that with higher dimensions. So with the higher dimensions, if we make x equal to, you know, all the columns, except for the last one, which is our class, we
can do the exact same thing. We can do the exact same thing. So here, and we can predict this. But now, our columns are equal to our data frame columns all the way to the last one. And then with this class, actually, so we can literally just say data frame columns. And we can fit all of this. And now, if I want to plot the k means classes. Alright, so this was my that's my clustered and my original. So actually, let me see if I can get these on the same page. So yeah, I mean,
pretty similar to what we just saw. But what's actually really cool is even something like, you know, if we change. So what's one of them where they were like on top of each other? Okay, so compactness and asymmetry, this one's messy. Right. So if I come down here, and I say compactness and asymmetry, and I'm trying to do this in 2d, this is what my scatterplot. So this is what you know, my k means is telling me for these two dimensions for compactness and asymmetry, if we just look at those two, these are our three
classes, right? And we know that the original looks something like this. And are these two remotely alike? No. Okay, so now if I come back down here, and I rerun this higher dimensions one, but actually, this clusters, I need to get the labels of the k means again. Okay, so if I rerun this with higher dimensions, well, if we zoom out, and we take a look at these two, sure, the colors are mixed up. But in general, there are the three groups are there, right? This does a much better job at assessing, okay, what group
is what. So, for example, we could relabel the one in the original class to two. And then we could make sorry, okay, this is kind of confusing. But for example, if this light pink were projected onto this darker pink here, and then this dark one was actually the light pink, and this light one was this dark one, then you kind of see like these correspond to one another, right? Like even these two up here are the same class as all the other ones over here, which are the same in the same color. So you don't
want to compare the two colors between the plots, you want to compare which points are in what colors in each of the plots. So that's one cool application. So this is how k means functions, it's basically taking all the data sets and saying, All right, where are my clusters given these pieces of data? And then the next thing that we talked about is PCA. So PCA, we're reducing the dimension, but we're mapping all these like, you know, seven dimensions. I don't know if there are seven, I made that number up, but we're mapping multiple dimensions
into a lower dimension number. Right. And so let's see how that works. So from SK learn decomposition, I can import PCA and that will be my PCA model. So if I do PCA component, so this is how many dimensions you want to map it into. And you know, for this exercise, let's do two. Okay, so now I'm taking the top two dimensions. And my transformed x is going to be PCA dot fit transform, and the same x that I had up here. And the same x that I had up here. Okay, so all the other
all the values basically, area, perimeter, compactness, length, width, asymmetry, groove. Okay. So let's run that. And we've transformed it. So let's look at what the shape of x used to be. So they're okay. So seven was right, I had 210 samples, each seven, seven features long, basically. And now my transformed x is 210 samples, but only of length two, which means that I only have two dimensions now that I'm plotting. And we can actually even take a look at, you know, the first five things. Okay, so now we see each each one is a two
dimensional point, each sample is now a two dimensional point in our new in our new dimensions. So what's cool is I can actually scatter these zero and transformed x. So I actually have to take the columns here. And if I show that, basically, we've just taken this like seven dimensional thing, and we've made it into a single or I guess to a two dimensional representation. So that's a point of PCA. And actually, let's go ahead and do the same clustering exercise as we did up here. If I take the k means this PCA data frame,
I can let's construct data frame out of that. And the data frame is going to be H stack. I'm going to take this transformed x and the clusters that reshape. So actually, instead of clusters, I'm going to use k means dot labels. And I need to reshape this. So it's 2d. So we can do the H stack. And for the columns, I'm going to set this to PCA one PCA two, and the class. All right. So now if I take this, I can also do the same for the truth. But instead of the k means
labels, I want from the data frame the original classes. And I'm just going to take the values from that. And so now I have a data frame for the k means with PCA and then a data frame for the truth with also the PCA. And I can now plot these similarly to how I plotted these up here. So let me actually take these two. Instead of the cluster data frame, I want the this is the k means PCA data frame. This is still going to be class, but now x and y are going to be
the two PCA dimensions. Okay. So these are my two PCA dimensions. And you can see that the data frame is going to be the same as the cluster data frame. So these are my two PCA dimensions. And you can see that, you know, they're, they're pretty spread out. And then here, I'm going to go to my truth classes. Again, it's PCA one PCA two, but instead of k means this should be truth PCA data frame. So you can see that like in the truth data frame along these two dimensions, we actually are doing fairly well
in terms of separation, right? It does seem like this is slightly more separable than the other like dimensions that we had been looking at up here. So that's a good sign. And up here, you can see that hey, some of these correspond to one another. I mean, for the most part, our algorithm or unsupervised clustering algorithm is able to to give us is able to spit out, you know, what the proper labels are. I mean, if you map these specific labels to the different types of kernels. But for example, this one might all be the
comma kernel kernels and same here. And then these might all be the Canadian kernels. And these might all be the Canadian kernels. So it does struggle a little bit with, you know, where they overlap. But for the most part, our algorithm is able to find the three different categories, and do a fairly good job at predicting them without without any information from us, we haven't given our algorithm any labels. So that's a gist of unsupervised learning. I hope you guys enjoyed this course. I hope you know, a lot of these examples made sense. If there
are certain things that I have done, and you know, you're somebody with more experience than me, please let me know in the comments and we can all as a community learn from this together. So thank you all for watching.
Copyright © 2025. Made with ♥ in London by YTScribe.com