hey gang welcome to the third video on statistical inference this one being on hypothesis testing my name's Justin zeltser and I'll put all the videos in the series up on Z statistics com or if you're watching via University you'll be able to see them in each of the module pages so this will be module 3 now this is quite an extensive look at hypothesis testing so if you're new to it see if you can get through the video because it really is quite a complete picture of hypothesis testing and I'm gonna start with a bit
of a look into the intuition behind hypothesis testing it won't involve any numbers but I think it's quite a good way to start to get us get our minds in the right frame for it we're then going to look at an example and that's going to lead us through the rest of the video so we'll use that example we'll use that example in looking at all the other subtopics and after we get familiar with this example we're going to look at these first three sub topics which are all somewhat related the first one is about
the null hypothesis would be defining what that is looking at the alternate hypothesis as well we'll be dealing with this concept of level of significance which is a very important concept we'll then look at this idea of a test statistic so in any given hypothesis test you generate one of these and in doing so you can also generate an accompanying p-value the much-maligned p-value and statistical circles at least we'll be looking at all three of these concepts which as I said are all related and what I mean by that is that they have the same
perspective which is the null hypothesis we'll see what that means when we get there but they all essentially assume the null hypothesis is true anyway after that we'll go on to discussing what a confidence interval is all about we've kind of loosely touched on this in previous videos in the series but the reason why I'm bringing up bringing it up again here is that it's kind of like an equal and opposite measure of the p-value they actually work in tandem but the difference being confidence intervals have the perspective of the sample whereas your p-value has
the perspective of your null hypothesis so look if these terms don't mean anything to you just hold on two seconds and we'll get into it you noticed but if you've been to your lectures and you've sort of heard these terms thrown around I'm just giving you a bit of a picture as to where we're going to be dealing with them we're then going to look at significant treatment difference power and sample size as well and this is actually my favorite bit so if you can stick around to the power and sample size section I think
if I can say so I do a pretty good job of trying to get an intuitive look at power and sample size as well so with all that done we'll finally have a look at a final we'll finally have a look at an example here a different one and it'll give you a chance to put what you have learned into practice and there's quite a lot of algebra here as well so there's a lot to get through but it should be a fun little journey and hopefully by the end you're going to be tip-top around
your hypothesis tests so let us begin with the intuition behind hypothesis testing so say that you think one dollar coins are tail biased and what I mean by that is that maybe they flip more tails than they do heads for whatever reason maybe there's subtle emore coin on the head side of the coin so it weighs it down or something like that I don't know it's a silly example but at least it gives us something we can visualize the question is how you would test this hypothesis scientifically now you might tell me that firstly you
got to flip a bunch of coins obviously you got to take some kind of sample or do an experiment that's true you then have to get out a pen and paper and note the proportion of tails you have in your sample but then what what happens then how do you know whether yes this coin is now biased or no this coin is not biased well that's where hypothesis testing is going to come in and help you out it's going to help you solve this riddle of yours that you've constructed for yourself and I've put together
a visualization to help us out here this is the number of tails from 100 tosses of a fair coin now the important portent part of this sentence is the word fair there so if the coin was fair as in not biased you might expect this being the probability distribution of possible outcomes from that 100 tosses of the coin it's not as if you have to get 50 tails from your 100 tosses of the coin right you might get 51 or 52 or it's even possible to get say 60 or 61 tails out of a hundred
and it still come from a fair coin it just so happened you had a very random sample very extreme sample now before we interrogate this further I'm gonna change this from the number of tails from 100 tosses to the proportion of tails from 100 tosses of a fair coin nothing else changes 50 now becomes 0.5 but this just allows us to generalize beyond this particular number of tosses of a coin so yes looking at this distribution you might notice that it's centered around 0.5 and that indeed is what we might expect to get from our
sample that's our expectation but say we had a sample that we got 52 tails in is that necessarily evidence of bias in this coin well no it seems quite reasonable that from a fair coin we could have just got 52 tails out of 100 but what then if we get a sample of 62 tails does that cast any doubt on the idea that the true proportion that the true probability is 0.5 well if you can see that this sample outcome casts more doubt on the coins fairness then this sample outcome you've effectively got the right
mind state for hypothesis testing and that's in fact all we're doing we're assessing our sample for how extreme it is that's all hypothesis test is it looks at a sample and asks how extreme is that sample and what will happen is we'll create this barrier or critical value beyond which we're happy to say yeah it's now becoming too extreme for us to really maintain this coins fairness but if they'll sample lies on the left side of this so say we got this 52 tails outcome we might say yeah look it's greater than 50 but it's
not extreme enough it's well within the realms of possibilities from a fair coin that my friends is a hypothesis test where your null hypothesis you're sort of starting value might be that theta the true proportion of coins is 0.5 that it's that black line there and we're gonna see if we have enough evidence we're gonna see if our sample is use extreme enough to suggest that the true proportions of greater than 0.5 in other words this coin is biased so we're gonna use this sample of a hundred coin tosses to make an inference about the
true population probability theta now as I said this value here might be our critical value and beyond this point we're going to start saying yep that is now too extreme for us to realistically hold on to this null hypothesis of theta being point 5 and that's actually called the rejection region that's the region in which we will be comfortable rejecting our null hypothesis now if you're a bit of a smartypants or you've paid attention in your lecture you might realize that there's two possible ways of looking at our alternate hypothesis you know you might note
we have a one-tailed alternate hypothesis here so we're only going to be rejecting this null hypothesis in one direction so in other words if our tails vastly exceeds our heads but it's also possible to do a two-tailed test as well and the difference might be instead of having a strict one-sided inequality in the alternate hypothesis this one it actually goes both ways so we'll have two different critical values what this means is that if we have a sample that is extreme in either direction we're happy to reject our null hypothesis so it really depends on
what kind of question you're asking are you asking specifically about the coin being tail biased as in more tales than heads or are you asking whether the coin is just biased in general and if it's the latter we might have two rejection regions one on the extreme positive side and one on the extreme negative side but again all we're doing is looking at the sample value and saying how extreme is it is it too extreme for us to maintain this null hypothesis okay so I feel like I've massaged your skulls into the right so that
they are now perfect absorbency for the actual nuts and bolts of all this so let's have a look at the example I provided here and these two examples I give on this particular lecture here are both to do with the same surgical concern which is to operate or not to operate its first example deals with proximal humerus fractures in the elderly and say something in the shoulder I'm told can you tell I'm a statistician and not someone medically trained nonetheless we can do the stats on this we have two particular treatment options here we have
an operative treatment and we also have physio only so for the operative patients or the patients that were treated operatively 62 had positive outcomes out of a hundred after six months whereas those that were given physio only had 48 out of a hundred with an improved outcome after six months so this is a very simplified example obviously with nice round numbers so we don't have to get to numerically intense the question is can we conduct a full hypothesis test to answer the following questions is there evidence of a difference in outcomes between the two treatments
and B say the orthopedic Society requires an outcome improvement of at least 5 percent before updating treatment protocols should they recommend surgery for proximal humerus fractures so here the idea is that it's not just good enough that operative treatment is slightly better than physio they need it to be at least a 5 percent better than physio to start recommending surgery so let's maybe there's an additional there might be additional risks to the patient when they go under the knife so those additional risks need to be offset by this additional buffer here anyway feel free to
actually if you think you're confident you know how to do these sorts of questions by all means stop the video here and have a go and see if you can come up with an answer but I will hint at this being a two-tailed test so you can see we've asked is there evidence of a difference in outcome so it doesn't matter which way it goes now hypothesis test here will be two-tailed all right let's tuck in let's talk about null hypothesis so finally we actually get to deal with some of the theory here let's call
for the to start with p1 the probability of a positive outcome for the operative group and P naught is the probability of a positive outcome for the non operative group the physio only group so our null hypothesis here is theta which is the population parameter for the difference between these two probabilities we're gonna say set that equal to zero in our null hypothesis and our alternate hypothesis is that theta is not equal to zero now in general the way of thinking about a hypothesis test or at least the attitude towards hypothesis tests from statisticians is
that we're always very pessimistic so for example say we're trying to show that there's a difference between these two groups the first thing we do is we assume that there's no difference so we assume that there's no difference and see if there's enough evidence to infer that there is a difference we don't simply assume the thing is true that we're trying to show first and then try to disprove it no we're pessimistic we always assume the reverse is true so if we're trying to seek evidence for a difference we'll first assume that there's no difference
and then in our alternative hypothesis we'll have finally that there is a difference between these two groups so another way of saying that is that whatever you're seeking evidence for this is a really good line whatever you're seeking evidence for goes in your alternate hypothesis we are seeking evidence to show that there's a difference between the two groups so that goes in our alternate hypothesis and I can almost see the question now getting typed in the comments section which is can you seek evidence to show that the difference is zero can you seek evidence for
sameness my answer is well no not really certainly not in very simple forms of statistics so our null hypothesis will always be some kind of equality and our alternate hypothesis is that in equality seeking that difference now the question is if the null hypothesis is true how will the sampling statistic now for the moment let's just say the sampling statistic is sort of theta with a hat on it what we can get from our sample which will be our samples p1 minus our samples P naught now if the null hypothesis is true how will it
be distributed well much like our slide on the intuition behind hypothesis testing it'll be distributed like a nice bell-shaped curve so again the important part of this is that the null hypothesis here is assumed to be true so if indeed there is no difference in the population between the two groups our sample should expect no difference between the two sample values and when I say expect I mean that that's the middle of the distribution due to random variation of course there will be some kind of probability distribution around that some kind of variance around this
sample value of zero but what variants how can you describe this distribution well the expected value of this distribution as I just said is zero it's the middle of this plot which makes sense because if the null hypothesis is true if you sampled a hundred patients that have the operative treatment and 100 patients that have physio only you would expect them to have the same proportion of successful outcomes that should be the middle of the distribution but what is the variance of this distribution well you'll note that we actually have two random variables here P
1 and P naught and this just involves a little bit of your recollection of linear algebra the variance of two uncorrelated random variables is just the variance of 1 plus the variance of the other so the variance of a proportion is in fact that proportion times 1 minus the proportion divided by the number of observations and the same will be true for the physio only group now if you recall when we were looking at the contents of the whole video I noted that this particular subtopic among three subtopics takes the perspective of the null hypothesis
we're assuming the null hypothesis is true so that means that P one and P naught are actually going to be the same so what we can do here is that this distribution here assumes the null hypothesis is true so that I can change this P 1 and P naught into just P which is this sort of grouped mean or grouped proportion so they're actually the same here and so we can simplify this a little bit to be P times 1 minus P times 1 on N 1 plus 1 on n naught so feeder hat here
we go is distributed normally with a mean of 0 and a variance given by this now again I see a question someone I'm sure is going to be asking me why on earth is this distributed normally we have a difference between two proportions these are essentially binomial distributions why is this normal and this is the lovely thing about statistics as soon as you sample enough as soon as your sample size is big enough everything just becomes normal I feel like every statistician should have a CLT tattoo on them somewhere without it we would be much
worse off CLT standing for the central limit theorem which allows us to put this B in here and say that in large samples sampling statistics tend to become normally distributed all right so let's go in a little bit further what's going to happen here is that we're going to construct well there'll be two separate rejection regions each of which will have well there'll be two separate rejection regions which in combination will sum up to 0.05 now why does it sum up to 0.05 well that's just our choice we're gonna label this thing called alpha as
0.05 and all that means is that this is the probability that we're willing to accept we're willing to take on board in rejecting a null hypothesis that might be true right let's just rewind that a little bit if we get a sample in this yellow region here we're going to be rejecting the null hypothesis that's how we've constructed this hypothesis test but realize that this distribution goes on for infinity so this black line the distribution which is the probability density function function where the null hypothesis is true it goes on for infinity so it's still
possible if our sample difference is 0.15 or 0.2 it's still possible for this null hypothesis to be true yet our sample just be very extreme so essentially we create artificially mind you this region beyond which we're sure enough and so this yellow area represents the chance we're willing to take on board that were actually going to be incorrectly rejecting a true null hypothesis now why 0.05 well they won't tell you this in your lecture but there's no reason at all statisticians love 0.05 and of course it's a we're attracted to it because it's nice and
round but it's completely arbitrary someone chose 0.05 many years ago someone probably by the name of Fisher or box or something like that and all of a sudden it kind of stuck and most statistical tests these days are done to the level of significance of 5% and that's what alpha represents there so it's essentially the probability of us being wrong in rejecting the null hypothesis and that's called termed a type 1 error we'll be looking at type 1 and type 2 errors a little bit later on in this video as well but that's what it
means so what's going to happen is we can calculate this critical value this point beyond which we'll be rejecting the null hypothesis then calculating our test statistic to see where or or in other words in which side of this critical value our test statistic lines so let's do it let's go to test statistics so as I said our sample difference is distributed normally with a mean of zero and this is our variance P one minus P times one on n one plus one on n zero now if that's the case let's consider this test statistic
that we're going to call T where it's going to be the sample difference divided by the standard error of that sample difference so it's essentially feder hat divided by the square root of the variance now if theta hat itself is distributed normally with a mean of 0 and a variance of all this junk then T our test statistic will be distributed normally with a mean of 0 and a variance of 1 it's quite simple to prove there if you take the variance of this expression it's just going to be the variance of theta hat which
is this divided by that squared which is in fact that again so it'll cancel out and you'll get 0 and 1 and the good thing about doing that is that we've now created a very standardized test statistic which we can compare to things like normal distribution tables and things like that another way of thinking about it is that the test statistic is just a scaled version of the sample difference it's just scaled by a factor of the standard deviation which is divided by the standard deviation so to round that point home this was the original
distribution that we saw on the previous in the previous bubble in the presentation notice that the sample difference here is that difference in proportions in our sample and there's what we drew on the previous slide as well but this axis here can be scaled so that it's the test statistic as well so it will - so it too has particular critical values beyond which we're going to rejecting the null hypothesis and because this is standardized we know what this critical value is when you have a distribution with mean zero and variance one the critical value
is 1.96 this critical value is the point above which lies 2.5% of the distribution why two point five percent well of course it's split in half so five percent which is our level of significance divided by two is two point five percent so if we find from our sample that we get a test statistic greater than 1.96 we know we're in this rejection region in other words we know we have a sample that's too extreme to hold on to our null hypothesis so let's calculate that test statistic the test statistic I've used lowercase T here
to represent the actual calculated value you could do t hat if you like but that's the sample difference divided by the square root of P hat times 1 minus P hat on all that stuff now if you recall from the actual sample and maybe you can cycle back 62 out of a hundred of the operative patients had improved outcomes at six months versus 48 out of a hundred for their physio only patients so this pooled proportion is going to be that 62 plus 48 over 200 so that's 110 on 200 which is 0.55 so the
pooled proportion here is 0.55 and and one's going to be 102 or n0 I should say is also going to be a hundred and theta hat well that's going to be 0.14 because that's the difference between the two proportions it was point six two minus point four eight so I could have put that all on the next line but I'm running out of room here so I've just calculated that for you at one point nine nine so that means that our calculated value of T is actually the rejection region its 1.99 versus 1.96 so very
close but we did manage to get it in the rejection region so if we zoom in here that's going to be 1.96 and our little test statistic lies somewhere there that little orange triangle so what does that mean well it means we can reject our null hypothesis so there's enough evidence to infer a difference between the true treatment options now the important thing here is that there's a specific level of significance we've used which is the 5% level of significance the test would be completely different had we used to say a 1% level of significance
that just means we're going to need to have a test statistic that's more extreme to reject the null hypothesis in that case it's probably not going to reject so this fairly arbitrary value we select our strictness strictness of the hypothesis test is actually quite important here and our decision hinges on what level of significance we've chosen but yes welcome to hypothesis testing nothing nothing is ever proven it's only ever inferred and I think maybe that's a good little takeaway from this point as well never use the word prove or disprove because you can't do that
in stats the only thing you can do is infer and in this case we can infer a difference between the two treatment options it means our sample difference happen to be extreme enough for us to suggest that in this case the operative patients did better than the physio only patients so let's have a little chew on p-value now the p-value is the proportion of repeated samples under the null hypothesis that would be as extreme as the test statistic we generated okay that seems like a mouthful but let's read it again the p-value is the proportion
of repeated samples under the null hypothesis that would be as extreme as the test statistic we generated so again like the previous two slides we assume the null hypothesis is true so the true difference is zero meaning that our expected sample difference is also zero the center of the distribution in other words is zero and of course when we scale it to this test to this test statistic at T the center of that distribution will also be zero and our alternate hypothesis is that the difference is non zero so here's the probability distribution again with
alpha being 0.05 and it's centered on zero because we assume the null hypothesis is true we found a test statistic of 1.99 which was just a shade to the right of this yellow bar here so we're just in the rejection region what a p-value is is the remaining shaded region beyond our test statistic what this red section is is the number of repeated samples under the null hypothesis so assuming the null hypothesis is true it's the proportion of samples that would have a more extreme test statistic than ours or I should say a test statistic
which is as extreme or more extreme than ours so our the sample was extreme enough for us to reject the null hypothesis but there would be some samples that are even more extreme this red section summarizes all of those possible samples and if you were to add up that whole red section as a proportion of this distribution it's going to be 0.04 7 now it's actually very difficult to calculate that by hand a computer program can do it for the purpose of this let's not worry about how that's calculated but you were probably going to
guess that it was going to be something very close to 0.05 right we only just rejected the null hypothesis from this test statistic here so the p-value had to be pretty close to 0.05 and in fact it had to be just slightly less than 0.05 that shaded red region would be slightly less than the shaded yellow region which is exactly 0.05 hopefully you can see the relationship between P and alpha if your p-value is less than alpha then there's enough evidence to reject the null hypothesis a means that our test statistic then must be in
the rejection region if P is greater than alpha then there's not enough evidence to reject the null hypothesis so the p-value is a really quick way of looking at how extreme our sample is given the null hypothesis is true so obviously if your p-value is very very low very very close to zero you're saying that or it's really unlikely to get a sample more extreme than ours right so in this case as P is 0.04 7 that's a bit of a typo there that should say 0.05 so with your mental powers turn that into a
0.05 please as the p-value is less than 0.05 we can reject the null hypothesis at the 5% level of significance and say there's enough evidence to suggest a difference in proportions P 1 and P naught so the good thing about p-values is that you can save you the trouble of going to of doing separate hypothesis tests for different levels of significance so we know because the p-value is 0.04 7 that we're going to reject the null hypothesis if the level of significance is 5% but we're not going to reject a null hypothesis if the level
of significance was 1% because if we're being strict about whether we reject the null hypothesis and this yellow region becomes a lot less our test statistic will now no longer lie in that rejection region so yes as I said hopefully you're realizing that all three of these concepts are related the null hypothesis well four of the concepts if you include these as two separate concepts the null hypothesis your level of significance your test statistic and p-value there like for moving parts if you move one the others must shift and they all have the same common
feature of assuming the null hypothesis is true or I should say the perspective being that null hypothesis all right so let's now take a look at confidence intervals so as I've written here confidence intervals are constructed around the sample statistic theta hat or alternatively the calculated test statistic T so what this means is that we now have a perspective which is not the null hypothesis our perspective is what we calculate from the sample itself now the distinction here between theta hat and calculated test statistic T is only one other scale and we saw that a
few slides ago but they're basically calculating or they're they're measuring the same thing but just on a different scale so you can construct a confidence interval around the calculated test statistic as well but perhaps more commonly we just take the sample statistic which is our sample proportion or sample mean and then we can construct a confidence interval around that so in our case we have a sample statistic theta hat which was 0.14 remember that the proportion of those having the surgery that had positive outcomes was 0.62 and the proportion of those with the physio only
group getting successful outcomes was 0.48 so the difference there's 0.14 and we can construct an interval around that point 1/4 which will exist independently of the null hypothesis now to construct an interval around this particular value we need to know what this standard error is of this measure theta hat so what's the standard error of theta hat well feet is just the difference between two proportions so the standard error is going to be the sum of all this junk here and we can sub in the values that we got from our sample so 0.62 is
p1 with a hat on it point four eight is P naught with a hat on it and then N 1 and N naught are both 100 so in summing those values in we'll get a sense of the variation of our theta hat measure and that is the standard error if you sub everything in you get zero point zero six nine seven so with that we can now construct a confidence interval and here's a 95% confidence interval we take our theta hat and we add and subtract that standard error times that factor from the normal distribution
because we know due to the central limit theorem that this theta hat is going to be distributed normally in large samples at least at Shelby so we add and subtract the appropriate value from these air distribution times the standard error and I've written here 0.975 because don't forget for a 95% confidence interval there's gonna be 2.5 percent in each of the tails of the distribution and if we add and subtract that value we get an interval which is 0.0035 zero point two seven six five and to interpret that we can say we are 95% confident
that the interval from zero point three five percent I've just converted that to a percent to twenty seven point six five percent contains the true population difference theta now I think this is probably the best way of describing a confidence interval you might be tempted to say something like we're 95% confident that the true population difference lies between these two values my subtle gripe with that phrasing is that you're insinuating that the population difference is a variable where we know the population difference theta is a fixed value so saying at the way I've said it
here is maybe a bit lighter on that and so you say you're basically saying that we're 95% confident that the interval overlaps this fixed value anyway that's for the sticklers out there but as you can see we've basically constructed an interval around the sample estimate so what's significant treatment difference all about well remember Part B said the orthopedic Society requires an outcome improvement of at least five percent before updating treatment protocols now the hypothesis we tested here was that theta was different from zero that was our alternate hypothesis that's what we were trying to seek
evidence for but in reality what the orthopedic Society wants us to do is to run a hypothesis test that might look a little bit like this we're seeking evidence for the true population difference to be greater than 0.05 greater than a five percent difference so it's a slightly different hypothesis that we're running nonetheless we can use what we've just done we can use the original hypothesis test and look at the p-values but and look at the confidence interval to assess whether we'd reject this secondary hypothesis test so what do I mean by that well let's
have a look at the p-value we got we were testing the null hypothesis that theta was equal to zero and this plot is going to show us the confidence interval associated with this treatment difference so recall that it went from about 0.3% to 27% something like that so very close to zero but this represents a 95% confidence interval for the true treatment difference now here I'm going to provide you with four other theoretical samples to say we did four more samples and these were our confidence intervals you can see that if the p-values say 0.58
it's going to be crossing zero we know that for sure because that's greater than 0.05 anytime the p value is more than 0.05 we know it must cross zero this confidence interval and so the same goes here with another sample that gave us a p-value of 0.06 for it's quite close to 0.05 but it's still greater than 0.05 so we know this confidence interval must cross zero our sample didn't cross zero but only just say we have another one with the p-value of 0.02 and it's a very this could be a very large sample for
example but it would have a very small confidence interval here nonetheless it's it's confined to one side of 0 and finally we have this test which is again all on one side of 0 the p-value is very very very low indeed so in which of these samples do you think will the orthopedic society be comfortable that the improvement is at least 5% he is 5% if I was to draw a line at 5% you can see that it's only this final sample where we're confident with our 95% confidence interval that it's all on one side
of this 5% difference our sample certainly doesn't achieve that even though we could be confident that the true treatment difference was greater than 0 we couldn't be confident at the 95% level that the treatment difference is greater than 5% you can see it crosses over 5% so that's why looking at confidence intervals as a little bit more malleable than your hypothesis test because that only gives you the outcome of one hypothesis that you're testing but here we can just look at this we can look at our test statistic create our interval and say you know
what the orthopedic Society won't be satisfied that the true treatment difference is greater than 5% even though we're confident that operative treatment does better than non-operative treatment so interestingly the way I've constructed this you can see that the p-value of 0.002 this sample here it's treatment difference was say on average say 2.5 and it's confined within this narrow region so we can actually be confident here that the true treatment difference is greater than zero but less than five now the only way that happens is if your sample size is really really large and perhaps that's
what happens with this particular sample we become a lot more confident of where the true treatment difference is so in this case the orthopedic orthopedic Society would say well that's good information to know and we will not be recommending surgery here because we know well we can infer that the true improvement for surgery over the non surgery treatments is less than five percent all right let's have a look at power and sample size now which is my sort of favorite section really so before we get into any calculations we're gonna have a look at this
two-by-two table now you might have seen this before but if the null hypothesis is true in other words there's no effect or no difference between the two groups we can either do one of two things we can either have a hypothesis test that rejects the null hypothesis or that doesn't reject the null hypothesis and if the null hypothesis is in fact true we would hope that we don't reject it right it would be wrong of us to reject the null hypothesis but nonetheless we could possibly be doing that so if we run our hypothesis test
and end up not rejecting that's called a true negative and the probability of that is given by 1 minus alpha we actually know what the probability of this is we set the probability of this outcome and we also set the probability of having a false positive what's called a type 1 error where we reject a null hypothesis that happens to be true and we usually set that to 5% but of course you can set that to 1% or 10% but what happens if the null hypothesis is false so in the event that it's false if
we don't reject it that's called a false negative in other words that's a type 2 error which we can deem beta again we hope we don't do this we hope that if the null hypothesis is false in other words there is an effect or a difference between the two groups we hope that we reject that and in that circumstance we can say that's a true positive which happens to be one minus beta and that's called power so the power of the model is the probability of rejecting of false null hypothesis in other words it's the
ability of a model to detect a specified difference but let's see how that interacts let's see what happens with specified differences that are larger and smaller so this is where it gets a little bit interesting this plot here is for our sample difference - theta hat right let's zoom in a little bit if we assume that theta is equal to zero so that there is no difference between the two groups our test statistic would be in this distribution here this black and distribution here but of course when we construct the hypothesis test we have an
inkling that there might be some difference between the two groups we have an inkling that operative patients might do better than those that receive physio only treatment so there's this theoretical difference between the two groups which were going to label Delta here the Greek letter Delta now this plot really provided for me when I it was only when I started teaching this particular topic that I saw this plot and it just rammed home what the power of a statistical test was all about so bear with me remember from the first part of this test if
we assume that the null hypothesis is true we construct this rejection region let's call it alpha now if it's a one tailed test like perhaps it is here alpha is only on one side here and say that's 5% then this will be 5% the yellow proportion of the whole black curve here now that's all well and good we can run our hypothesis test see where our test statistic lies and if it's in that yellow region we'll reject the null hypothesis but if indeed this red distribution is where the data is we're probably more likely than
not to reject the null hypothesis right most of the distribution is in this rejection region so we've set up the null hypothesis essentially to fail right that's the whole point we set up this null hypothesis and kind of hope that it fails if we're trying to if we're really seeking evidence to show there's a difference and if the difference is size Delta in reality more likely than not we are going to reject the null hypothesis unless we end up getting a sample in this part of the distribution to the left of this black line if
we're in this part of the distribution it just so happens that we will not reject the null hypothesis so that is going to be our area beta so while alpha is something that we create for ourselves arbitrarily but we select say 0.05 because it's nice and round we select that for ourselves but beta actually depends on quite a lot of factors it depends on alpha but it also depends on how far away these two curves are so what the theorized difference is going to be and it also depends on how skinny each of these two
distributions are as well but I'll talk a bit more about that in a moment but what happens then if that's beta is that the power of our test is one minus beta and that's given by this green section here so if indeed there's a true difference Delta this green section here is the power of our model to reject a false null hypothesis so you can see that how it was a three step process we first needed to figure out what was the rejection region for the boring old hypothesis test then we were like alright if
there is this actual difference in reality what chance do we have of really rejecting that null hypothesis and the chance of that is 1 minus beta ok now looking at this section we're calling power here think about what happens if the theorized difference is enlarged so let's just say that we thought the operative group was in reality going to do much better than the non-operative group but what happens is that these two curves get separated imagine you just kind of pushed these two curves apart you'll see then that beta the area that we're calling beta
here is going to become a lot smaller and the area that's 1 minus beta is actually become a lot larger more of the proportion of this distribution is going to lie to the right of this black line so if the difference that we're theorizing is true in the population gets larger we're going to become more likely to find it more likely to reject that null hypothesis but if the theorized difference is very very small then these two groups are going to be quite close together these two and bell curves have will be quite close together
and it'll become less likely to reject the null hypothesis so you're less likely to find the difference if the difference in reality is smaller and the second point to note is for a given different so let's just presume this difference is Delta here what happens if you make each of these two curves skinnier think about it you're squeezing this black curve together and squeezing this red curve together but keeping their centers the same so imagine you take your hands into the distribution and squeeze them both together again the proportion of the distribution to the right
of that black line is going to increase right you're squeezing out this beta area such that most of this curve now is going to be 1 minus beta well what am i implying when I say we're making the curves skinnier well I'm implying that the sample size is increasing so when your sample size increases don't forget the variance of each of these remember how we calculated the variance of these it always had some kind of divided by n on it divided by root n so in the sample size increases these two curves are going to
get a lot skinnier for a given difference and in doing so you're going to become more likely to reject that null hypothesis if indeed there's a difference so how do we summarize that the power will increase for an increasing difference and the power will also increase for an increasing sample size over a given difference so now we're going to have a look at example 2 which will give us a chance to put into practice some of that stuff we've just learned on power and sample size now be warned it's going to get a little ugly
in terms of algebra but this is a really good chance to sort of reel to really nail down some of these concepts so I'm not too concerned that it might be a little a little fresh when it comes to algebra here but to operate or not to operate it's another in the same sort of vein as our first example this wants to do with ankle fractures in children and again the theory in real life is that ankle fractures in young patients may heal of their own accord once they um get some physio and things like
that this might not be a need for putting them under the knife but here we have a sample of 800 children with ankle fracture that's a lot of ankles fractured where 400 are provided an operative treatment and 400 are provided physio only the outcome of interest is the ability of the child to walk normally at 3 months ok so the first question we're going to ask is what power will a hypothesis test possess to detect a 10% improvement in the operative cohort if we assume 50% of the physio only cohort will walk normally at 3
months so there's the detectable difference 10% so so what power will this hypothesis test have to find that difference if indeed in the operative cohort say 60% will walk normally at 3 months so where I say a 10% improvement here I mean from 50% to 60% what power will our hypothesis test have to detect such a difference and I'm saying to use a one-tailed hypothesis test here so we're only really interested in that one direction where the operative cohort does better than the non operative than the physio only cohort so that's part A Part B
will actually get a chance to calculate a sample size given power so you can see that in the question we got it given a sample size of well 800 400 and each cohort in Part B here were asked to relax that and I'm giving you the power of 90% and saying calculate the sample size I've said here that it's under a balanced design and there I mean to say that the number of observations in each group is the same so that'll simplify some of the calculations for us so what total sample size under a balanced
design what we need to provide a power of 90% okay so feel free to if you're really thinking you've got a handle on all this stuff have a go at answering this but we haven't quite dealt with some of this formula yet so I I'm not too concerned if you want to just sit back and watch me do this because yeah it is quite difficult but having done this you'll really get a sense of how some of these calculations get done and just again another forewarning I tend not to just throw formula at you and
ask you to fill and ask you to substitute in figures into the formula I'm really coming at this from first principles so your textbook might provide a formula it's like sample size calculation N equals blah and you're subbing everything I am not really approaching it like that I'm really going to try to take this take the first principles approach to this which hopefully the fruits of which will be you having a really solid understanding of how everything comes together all right so the first thing is to assess what the null hypothesis is going to be
and that's that theta is equal to 0 our alternate hypothesis will be that theta is greater than 0 because it's a one tailed test now what I've done here is I've put these two curves side-by-side again to really give us an impression of what's going on the curve in black here is the distribution of our test statistic we'd get if the null hypothesis is true so there's theta being 0 that's the difference between the two cohorts our sample statistic will have a mean of 0 and some kind of variance about it now if we're trying
to say that in reality the true difference will be 10% well this red curve will result and the question is going to be how much of this red curve is going to be to the right of this yellow line and as we said before this shaded region here is going to be 5% because this null hypothesis well the hypothesis test we're going to use assumes that the null hypothesis is true and we will reject that null hypothesis if we're in this upper 5% so if this red distribution is in reality what results the proportion of
that red distribution to the right of this yellow line will be the ability for our test to pick up this difference it's the ability of our test to reject the null hypothesis given that this really is that true difference so hopefully that's really ramming at home so basically that's what we're about to do we're about to try to find the proportion of this curve the proportion under the red curve that lies to the right of that yellow line so the first question might be well we need to find the variance of both of these two
curves because don't forget this yellow line depends on the variance of the black curve and then of course the area of the red curve to the right of that line depends on the variance of the red curve so we're going to need to know both so the variance of the black one this assumes the null hypothesis is true such that both peas both proportions are the same in each cohort so that's why we have P times 1 minus P here on n 1 plus P times 1 minus P on n naught we saw that in
a previous section the variance of H 1 so that is this red curve here is going to be this one here P 1 times 1 minus P 1 plus P naught times 1 minus P naught so we allow them to be different in each case now because we've actually specified and this is going to help us with calculations obviously we've specified what the proportions of good outcomes are for the non-operative group we've said that it starts at 50% that's what we expect the physio only group to be and we're going to test to see whether
we can detect a difference where our operative cohort goes up to 60% so this value of P I'm going to put in here is 50% and it's slightly different to what we used in a previous slide where we kind of averaged the two but don't forget the null hypothesis here isn't simply that the two are equal we're kind of saying that the two are equal but they're equal at 50% we've kind of specified that so 0.5 times 0.5 divided by 400 plus 0.5 times 0.5 divided by 400 400 being the number of observations in each
of the respective groups so we can actually find the answer to that and that's 0.001 to 5 so that would be the variation also a the variance of the black curve there and for H 1 we do the same thing except it's going to be 0.6 times 0.4 on the other one so the variance of the red curve is going to be just slightly less feel free to check my calculations if you want I'm pretty satisfied with them so now what we're going to do is we're gonna let T 1 which is our test statistic
be the calculated difference between the two groups which will be theta hat divided by the square root of V H naught why is that the case well don't forget for a test statistic we we take the difference and divide by the standard error right in this case it's the square root of the variance so that will allow us to put this black curve on a scale of just a standardized normal distribution so remember our last slide this this means that T 1 will be distributed with a mean of well theta hat and a variance of
1 and a variance of 1 ok so what we're going to do now is try to find the expected value of T 1 and also the variance of T 1 bear with me we'll see where this goes but but essentially what's happening here is that we're going to need to put both of these two curves on the same scale it's no good comparing apples with oranges we want them on exactly the same scale and that scale is going to be our test statistics scale t1 with a mean of 0 here the black curve and it'll
have a variance of 1 so we've got to put this red curve on that same scale so the expected value of t1 is the expected value of theta hat on the square root of the variance of H naught and in this case it's going to be 0.1 divided by well there's the square root of H naught or chuck it in there and you get 2 point 8 2 8 so this is the center of our red curve on the new scale on the scale of our test statistic the red curve has an expected value of
two point eight two eight and what's the variance of that red curve on our new scale well we're gonna take the variance of t1 and we're going to just basically put the equation for t1 in the brackets here and you'll note that it's the variance of theta hat divided by or the variance of this character down here now that's a constant term so we remember your little rules about variance if you have a variance of something which is a constant term it comes out the front and gets squared so the variance of t1 which is
our red curve is going to be the variance of theta hat divided by V H naught so vh1 is the variance of our treatment difference but that's going to be that variance assuming that difference of 0.1 and V H naught assumes a difference of zero so we get V H one divided by V H naught and because we have those calculated up here on the top left we can just sub the sin and we get point nine eight so what does that mean it means that this red curve on the scale of this test statistic
has an expected value of two point eight two eight and a variance of 0.98 so this seems quite confusing but all we've really done is we noted that our hypothesis test will proceed with this black curve here and we know that we'll reject when we're up in this region but we want to know when is this red curve lie on the same scale now we have it we know this red curve on the scale of this hypothesis test here for the test statistic on the scale of this test statistic has a mean of two point
eight two eight and has a variance of 0.98 so how handy is that now that we have all that information we know that this point here is going to be one point six four five because that is the that is the critical value that is provided when you have a one tailed test to five percent level of significance and we know that the expected value of this curve is two point eight to eight so all we're doing now is looking at a curve that has an expected value of two point eight to eight a variance
of 0.98 and we want to know what proportion of the curve is to the right of one point six four five so this is the easy bit now we just sub in one minus this as a Phi which is the basically provides you the cumulative distribution function for a given Zed score which is a standardized normal value so we're gonna go one point six four five minus two point eight two eight divided by the standard deviation which is the square root of 0.98 so the power is going to be one minus the CDF of negative
120 which happens to be 0.88 for one so the CDF essentially is just the amount of the distribution to the left of a given point on a normal distribution so you can read that off tables or you can use Excel your norm s dist function on Excel our answer is 0.88 for one okay so we're going to go back here and we have eighty eight point four percent of this red distribution is lies to the right of this yellow line and that indeed is the power of this test so look you might find in textbooks
you might get given formulas for this and you can sub the stuff in you're certainly welcome to use those but I wanted to try my hand at really giving you a visualization of what's going on because I really don't like just saying hey use a formula and then move on stats to be way more interesting than just using the formula all the time so let's have a quick look at answer for Part B this one is now probably well depends algebra heavy but it might be a little bit simpler conceptually all we do here is
we reset our values of the expected value of t1 and the variance of t1 we've got those formulas there what we're gonna do is redo what we just did but set the power to be 0.9 instead of well in our case we got zero point eight eight four so here we're gonna say zero point nine is equal to one minus the CDF of this stuff here we're just subbing in all the values for the expected value of t1 and the variance of T 1 respectively so you can see here the expected value of t1 is
this thing and the variance of t1 is that with a big square root on it so I'm doing a few steps at once here but we're just multiplying top and bottom by the square root of V H naught so I'll go through this pretty quickly to get rid of this CDF function we're essentially taking the inverse function of 0.1 the cumulative distribution inverse of 0.1 and you get minus one point two eight one six so all the stuff in the brackets here will equal that now it's about time we have to revisit V H naught
and V H 1 so on the right here we'll have a quick look at what those were V H naught was this character here or the square root of V H naught happens to be this but notice how we have N 1 and n 2 I mentioned in the question here that we're looking for a balanced design so this is where it's going to simplify a little bit n 1 it can be written as just n on 2 so the total number of observations divided by 2 and so 2 can end - I've just realized
I've used n 0 in the previous slide but that's ok same thing so each of these will be the total number of observations divided by 2 because we have a balanced design and if you meter all that out you get 1 on the square root of n similarly for vh1 you'll get in this case the square root of 0.98 over the square root of n so I'll put that in a little in a little box and we're just about to substitute in the square root of V H naught and the square root of V H
1 into our equation here on the left and rearrange do a little bit of algebra and we're going to find that n is equal to 84 point 8.9 so the number of observations that allows us to have a power of 0.9 is going to be 8 48.9 now I've actually rounded this up to 850 because you're gonna have a you're gonna have to have a whole number of people in each group so it has to be divisible by 2 and of course when we do sample size calculations you always round upwards so 850 is going
to be our man meaning that we'll have 425 observations in each group so with the sample size that's slightly larger than what we had in the first question it'll allow us to get a power which is also slightly larger gives us more power in our test all right congratulations you made it my goodness what a hopefully you've got through unscathed but if you've got any questions I feel free to put them down in the comments of the video or you can type up on the discussion board on the Uni discussion board and as I said
at the beginning all videos in the series and other statistical videos can be found on Zed statistics com thanks for watching