Linear Regression, Clearly Explained!!!

1.4M views3694 WordsCopy TextShare

StatQuest with Josh Starmer

The concepts behind linear regression, fitting a line to data with least squares and R-squared, are ...

Video Transcript:

Say that on a boat headed towards that quest Join me on this boat. Let's go to stab quest it's super cool Hello, and welcome to static quest Static Quest is brought to you by the friendly folks in the genetics department at the University of North Carolina at Chapel Hill Today, we're going to be talking about linear regression Aka General Linear models part one there's a lot of parts to linear models But it's a really cool and powerful concept so let's get right down to it. I Promise you I have lots and lots of slides that talk about all the nitty-Gritty details behind linear regression But first let's talk about the main ideas behind it The first thing you do in linear regression is use least squares to fit a line to the data The second thing you do is calculate R squared lastly calculate a p-value for R squared There are lots of other little things that come up along the way but these are the three most important concepts behind linear regression in the stat Quest fitting align to Data we talked about fitting align to Data duh But let's do a quick review.

I'm Going to introduce some new terminology in this part of the video so it's worth watching even if you've already seen the earlier stat quest That said if you need more details check that stat quest out For this review we're going to be talking about a data set where we took a bunch of mice And we measured their size, and we measured their weight our Goal is to use mouse weight as a way to predict mouse size first draw a line through the data Second measure the distance from the line to the Data square each distance and then add them up Terminology alert the distance from the line to the Data point is called a residual third rotate the line a little bit With the new line measure the residuals square them and then sum of the squares Now rotate the line a little bit more sum up the squared residuals ETc, Etc, ETc. We rotate and then sum of the squared residuals Rotate that sum of the squared residuals just keep doing that After a bunch of rotations you can plot the sum of squared residuals and corresponding rotation so in this graph We have the sum of squared residuals on the y-Axis and the different rotations on the x-Axis Lastly you find the rotation that has the least sum of squares More details about how this is actually done in practice are provided in a stat quest on fitting a line to data So we see that this rotation is the one with the least squares, so it will be the one to fit to the data This is our least squares rotation Super imposed on the original Data Bam Now we know why the method for fitting a line is called least squares Now we have fit along to the data. This is awesome Here's the equation for the line least Squares estimated two parameters A yxs intercept and a slope Since the slope is not zero it means that knowing a mouse's weight will help us make a guess About that mouse's size How good is that guess?

Calculating r-squared is the first step in determining. How good that guess will be? the scat Quest R-Squared explained talks about you got it R-Squared Let's do a quick review.

I'm also going to Introduce some additional terminology So it's worth watching this part of the video even if you've seen the original stat quest on r-Squared first Calculate the average mouse size okay, I've just Shifted all the Data points to the y access to emphasize that at this point We are only interested in mouse size Here I've drawn a black line to show the average mouse size bam Now some the squared residuals Just like in least squares We measure the distance from the mean to the data point and square it and then add those squares together Terminology alert we'll call this ss. Mean for sum of squares around the mean Note the sum of squares around the mean equals the data minus the mean squared The variation around the mean equals the data minus the mean squared Divided by n n Is the sample size in this case N equals 9? The shorthand notation is the variation around the mean equals the sum of squares around the mean?

Divided by N. The sample size Another way to think about variance is as the average sum of squares per mouse Now go back to the original plot and sum of the squared residuals around our least squares fit We'll call this ss. Fit for the sum of squares around the least squares fit The sum of Squares around the least squares fit is the sum of the distances between the data and the line squared Just like with the mean the variance around the fit is the distance between the line and the data squared Divided by N.

The sample size The shorthand is the variation around the fitted line Equals the sum of Squares around the fitted line divided by n. The sample size Again, we can think of the variation around the fit as the average of the sum of squares around the fit for each mouse in general the Variance of something equals the sum of squares Divided by the number of those things in other words. It's an average of sum of Squares I Mention this because it's going to come in handy in a little bit.

So keep it in the back of your mind Okay, let's step back a little bit. This is the raw variation in mouse size and This is the variation around the least squares line there's less variation around the line that we fit by least squares that is to say the residuals are smaller as a result we say that some of the variation in mouse size is explained by taking mouse weight into account in Other words Heavier mice are bigger Lighter mice are smaller R-Squared tells us how much of the variation in Mouse size can be explained by taking Mouse weight into account? This is the formula for R.

Squared. It's the variation around the mean Minus the variation around the fit divided by the variation around the Let's look at an example in this example the variation around the mean equals Eleven Point one and the variation around the fit Equals four point four so we plug those numbers into the equation The result is that R squared equals zero point six? Which is the same thing as saying 60%?

This means there is a 60% reduction in the variance when we take the mouse weight into account Alternatively we can say that mouse weight explains 60 percent of the variation in mouse size We can also use the sum of squares to make the same calculation This is because when we're talking about variation everything's divided by n. The sample size Since everything scaled by N We can pull that term out and just use the raw sum of squares In this case the sum of squares around the mean equals 100 and the sum of squares around the fit equals 40 plugging those numbers into the equation Gives us the same value we had before R squared equals zero point six which equals 60 percent? 60 percent of the sums of squares of the mouse size can be explained by mouse weight Here's another example We're also going to go back to using variation in the calculation since that's more common in This case knowing mouse weight means you can make a perfect prediction of mouse size The variation around the mean is the same as it was before Eleven point one, but now the variation around the fitted line equals zero because there are no residuals Plugging the numbers n.

Gives us an r-Squared equal to one which equals one hundred percent in This case mouse weight explains 100 percent of the variation in mouse size Okay, one last example In this case knowing mouse weight doesn't help us predict mouse size If someone tells us they have a heavy mouse well that mouse could either be small or large with equal probability Similarly if someone said they had a light mouse well again We wouldn't know if it was a big mouse or a small mouse because each of those options is equally likely Just like the other two examples the variation around the mean is equal 11. 1 However in this case the variation around the fit is also equal 11. 1 so we plug those numbers in and we get R squared equals zero which equals zero percent in This case mouse wait doesn't explain any of the variation around the mean when calculating the sum of squares around the mean we collapse the points onto the Y-Axis just to emphasize the fact that we were ignoring mouse weight but we could just as easily draw a line y equals the mean mouse size and Calculate the sum of squares around the mean around that in This example we applied R squared to a simple equation for a line y equals zero point one plus zero point seven eight times x This gave us an r-Squared of sixty percent Meaning 60 percent of the variation in mouse size could be explained by mouse weight But the concept applies to any equation no matter how complicated First you measure square and sum the distance from the data to the mean In measure Square and sum the distance from the data to the complicated equation Once you've got those two sums of squares.

Just plug them in and you've got R squared Let's look at a slightly more complicated example Imagine we wanted to know if mouse weight and tail length did a good job predicting the length of the mouse's body? So we measure a bunch of mice To plot this data. We need a three-dimensional graph We want to know how well weight and tail length predict body length The first mouse we measured had weight equals 2.

1 Tail length equals 1. 3 and body length equals 2. 5 So that's how we plot this data on this 3D graph Here's all the data in the graph The larger Circles are points that are closer to us and represent mice that have shorter tails The smaller Circles are points that are further from us and represent mice with longer tails Now we do a least-squares fit Since we have the extra term in the equation Representing an extra dimension we fit a plane instead of a line Here's the equation for the plane the Y-value represents body length least Squares estimates three different parameters The first is the y-intercept that's when both tail length and mouse weight are equal to zero The second parameter zero point seven is for the mouse weight the last term zero point five is for the tail length if We know a mouse's weight and tail length.

We can use the equation to guess the body length for example given the weight and tail length for this mouse the equation predicts this body length Just like before we can measure the residuals square them and then add them up to calculate R squared now if the tail length where the z axis is Useless and doesn't make the sum of squares fit any smaller then least squares We'll ignore it by making that parameter equal to zero in this case Plugging the tail length into the equation would have no effect on predicting the mouse size this means Equations with more parameters will never make the sum of squares around the fit worse than equations with fewer parameters in other words this equation mouse size equals 0. 3 plus mouse weight plus flip of a coin plus favorite color plus Astrological sign plus extra stuff will never perform worse than this equation mouse size equals 0. 3 plus mouse weight this is because Li squares will cause any term that makes sum of squares around the fit worse to be multiplied by 0 and in a sense no longer exists Now due to random chance There is a small probability that the small mice in the data set might get heads more frequently than large mice If this happened, then we'd get a smaller sum of squares fit and a better r-squared Wha-wha here's the frowny face of sad times?

The more silly parameters we add to the equation the more Opportunities we have for random events to reduce sum of squares fit and result in a better r-squared Thus people report an adjusted r-Squared value that in essence Scales R-Squared by the number of Parameters R Squared is awesome, but it's missing something What if all we had were two measurements? We'd calculate the sum of squares around the mean in this case. That would be 10 Then we calculate the sum of squares around the fit which equals 0 The sum of Squares around the fit equals 0 because you can always draw a straight line to connect any two points What this means is when we calculate r squared by plugging the numbers end we're going to get 100% 100% is a great number we've explained all the variation But any two random points will give us the exact same thing.

It doesn't actually mean anything We need a way to determine if the r-Squared value is statistically significant. We need a p-value Before we calculate the p-value Let's review the main Concepts behind R-Squared one last time The general equation for R Squared is the variance around the mean minus the variance around the fit divided by the variance around the mean in Our example this means the variation in the mouthsize - the variation after taking weight into account divided by the variation in mouth size in Other words R. Squared equals the variation in mouth size explained by weight divided by the variation mouth size without taking weight into account in This particular example R.

Squared equals zero point six Meaning we saw 60 percent reduction in variation once we took mouse weight into account Now that we have a thorough understanding of the ideas behind r squared let's talk about the main ideas behind Calculating a p-value for it The P-value for R. Squared comes from something called f f is equal to the variation in mouth size explained by weight divided by the variation mouth size not explained by weight the numerators for r-Squared and for f are the same That is to say it's the reduction in Variance when we take the weight into account The denominator is a little different These dotted lines the residuals Represent the variation that remains after fitting the line. This is the variation that is not explained by weight So together we have the variation in mouth size explained by weight divided by the variation in mouth size not explained by weight Now let's look at the underlying mathematics just as a reminder.

Here's the equation for R squared This is the general equation that will tell us if R squared is significant The meat of these two equations are very similar and rely on the same sums of squares Like we said before the numerators are the same in our mouth size and weight example the numerator is the variation of mouth size explained by weight and The sum of Squares around the fit is just the residuals squared and summed up around the fitted line So that's the variation that the fit does not explain These numbers over here are the degrees of freedom? They turn the sums of squares into variances. I'm Going to dedicate a whole stack west to degrees of freedom but for now Let's see if we can get an intuitive feel for what they're doing here Let's start with these P fit is the number of parameters in the fit line Here's the equation for the fit line in a general format.

We just have the y-intercept plus the slope times x The Y-intercept and the slope are two separate parameters That means p fit equals two P mean is the number of parameters in the mean line? in general that equation is y equals the y-intercept that's what gives us a horizontal line that cuts through the Data in This case the y-intercept is the mean value This equation just has one parameter Thus P mean Equals 1 both equations have a parameter for the y-intercept However, the fit line has one extra parameter the slope in our example This slope is the relationship between weight and size in This example p. Fit minus p mean equals 2 minus 1 Which equals 1 The fit has one extra parameter mouse weight Thus the numerator is the variance explained by the extra parameter in Our example that the variance in mouse size explained by mouse weight if We had used mouse weight and tail length to explain variation in size Then we would end up with an equation that had three parameters and p fit would equal 3 Thus P fit minus P mean would equal 3 minus 1 Which equals 2 Now the fit has two extra parameters mouse weight and tail length With the fancier equation for the fit the numerator is the variance in mouse size explained by mouse weight and tail length Now let's talk about the denominator for our equation for f The denominator is the variation in mouse size not explained by the fit that Is to say it's the sum of squares of the residuals that remain after we fit our new line to the data Y divided sum of squares fit by N.

Minus p fit instead of just n Intuitively the more parameters you have in your equation the more data You need to estimate them for example. You only need two points to estimate a line but you need three points to estimate a plane if The fit is good then the variation explained by the extra parameters in the fit will be a large number and the Variation not explained by the extra parameters in the fifth will be a small number that makes f a really large number Now that question we've all been dying to know the answer to how do we turn this number into a p-value conceptually generate a set of random Data calculate the mean and the sum of squares around the mean Calculate the fit in the sum of squares around the fit Now plug all those values into our equation for f and that will give us a number in this case that number is 2 Now plot that number in a histogram Now generate another set of random Data calculate the mean and the sum of squares around the mean Then calculate the fit and the sum of squares around the fit Plug those values into our equation for f and in this case we get f equals 3 so we then plug that value into our histogram and Then we repeat with yet another set of random Data in this case We got f equals 1 that's plotted on our histogram, and we just keep generating more and more random data sets Calculating the sums of squares plugging them into our equation for f and plotting the results on our histogram now Imagine we did that hundreds if not millions of times When we're all done with our random data sets we return to our original data set We then plug the numbers into our equation for f in this case. We got f equals 6 the P value is the number of More Extreme values divided by all of the values So in this case we have the value at f equals 6 and the value at f equals 7 divided by all the other Randomizations that we created originally if this concept is confusing to you I have a stat quest that explains p values so check that one out Bam You can approximate the histogram with a line in Practice rather than generating tons of random Data sets people use the line to calculate the P value Here's an example of one standard f distribution that people use to calculate p values the degrees of freedom Determine the shape the Red Line represents another Standard F distribution That people use to calculate P values in This case the sample size used to draw the red line is smaller than the sample size used to draw the blue line Notice that when n minus p fit equals 10 the distribution tapers off faster This means that the p value will be smaller when there are more samples relative to the number of parameters in the fit equation triple Bam Hooray we finally got our p value now.

Let's review the main ideas Given some data that you think are related linear regression Quantifies the relationship in the data This is R squared this needs to be large It also Determines how reliable that relationship is This is the p value that we calculated with f this needs to be small you Need both to have an interesting result Hooray We've made it to the end of another exciting stat quest wow this was a long one I hope you had a good time If you like this and want to see more stat quest like it wants you subscribe to my channel it's real easy just click the red button and If you have any ideas of that quest that you'd like me to create just put them in the comments below That's all there is to it alright.