1. Introduction to Statistics

2.03M views13870 WordsCopy TextShare

MIT OpenCourseWare

*NOTE: This video was recorded in Fall 2017. The rest of the lectures were recorded in Fall 2016, b...

Video Transcript:

the following content is provided under a Creative Commons license your support will help MIT OpenCourseWare continue to offer high quality educational resources for free to make a donation or to view additional materials from hundreds of MIT courses visit MIT opencourseware at ocw.mit.edu okay so the course you're currently sitting in is 18 650 and it's called fundamentals of statistics and until last spring it was still called statistics for applications and it turned out that really based on the content fundamentals of statistics was a more appropriate title so I'll tell you a little bit about what we're

gonna be covering in class what this class is about what it's not about there's several like I realize there's several offerings and statistics on campus so I want to make sure that you've chosen the right one I also and I also understand that for some of you it's a matter of scheduling I need to actually throw out a disclaimer I tend to speak too fast I'm aware of that you know someone in the back just do it like that when you have no idea what I'm saying hopefully will repeat myself many times so if you

average over time you will see that statistics will tell you that you will get the right message that I was actually trying to stand to send alright so what are the goals of this class so the first one is basically to give you an introduction no one here is expected to have seen statistics before but as you will see you were expected to have seen probability and usually you do see some statistics in a probability course so I'm sure some of you have some ideas but I won't expect anything and we'll be using mathematics right

it's a math class so there's gonna be you know a bunch of equations not so much real data and you know statistical thinking we're gonna try to provide the article guarantee so if I have two estimators that are available for me how Theory guides me to choose between the best of them how certain can I be of my guarantees or prediction it's one thing to just fit out a number another thing to put some error bars around so we'll see how to build error bars for example and you know you will have your own applications

and happy to answer questions about specific applications but rather than trying to tailor applications to an entire Institute I think we're gonna work with pretty standard applications mostly you know not very serious ones and hopefully you'll be able to take the main principles back with you and apply them to your particular problem so what I'm hoping that you will get out of this class is that when you have a real-life situation and by real-life I mean mostly at MIT so some people probably would not call that real-life the goal is to formulate sickled problem in

mathematical terms if I want to say is a drug effective that's not in mathematical terms I have to find out which measure I want to have to call it effective I want to maybe it's over a certain period of time maybe it's over so there's a lot of things that you actually need and I'm not really going to tell you how to go from the application to the point you need to be but I will certainly describe to you what point at what point you need to be at if you want to start applying this

ago methodology then once you understand what kind of question you want to answer you all want to yes/no answer do I want a number do I want error bars don't want to make predictions five years in the future do I have sight information or do I not have sight information all these things based on that hopefully you will have a catalog of statistical methods that you're going to be able to use and and and and you know apply in the wall and also you know not no statistical method is perfect some of them of people

have agreed upon over the years and people understand that this is the standard but you I want you to be able to understand what the limitations are and when you make conclusions based on data that those conclusions might be erroneous for example all right more practically my goal here is to have you ready so who has taken for example a machine learning class here all right so many of you actually is maybe a third I've taken a machine learning class the goal here will be to take you so statistics has somewhat evolved into machine learning

in recent years and my goal is to take you there so machine learning has a strong algorithmic component so maybe some of you have taken a machine learning class that displays mostly the algorithmic component but there's also a statistical component right the machine learns from data and hopefully so this statistical track and there's some statistical learning which are statistical machine learning classes that you can take here so they're offered at the graduate level I believe but once you take those class I want you to be ready to be able to take those classes having the

statistical fundamentals to understand what you're doing and then you're gonna be able to you know expand to broader and more sophisticated methods so lectures are here from 11:00 to 12:00 30s on Tuesday and Thursday Victor Emmanuel will also be an Anakin column vector will also be holding mandatory recitation so please go on stellar and take your recitation it's either 3 to 4 or 4 to 5 on Wednesdays and it's gonna be mostly focused on problem solving they're mandatory in the sense that well we we have we're allowed to do this but I don't they're not

gonna cover entirely new material but they might cover some you know techniques that might save you some time when it comes to the exam so you know you might get by so you know attendance is not gonna be taken or anything like this but I highly recommend that you go because well they're mandatory so you cannot really complain that something was taught only in recitation so please register on stellar for which of the two recitations you would like to be in there's their capped at 40 so first-come first-serve so homework will be due weekly there's

a total of 11 problem sets I realize this is a lot hopefully we'll keep them light I just want you to you know not Russ too much the ten best will be kept and this will come for a total of thirty percent of the final grade they were due Mondays at 8:00 p.m. on stellar and so this is a new thing I'm not we're not gonna use the boxes outside of the math department we're gonna use only PDF files so well you're always welcome to type them and practice your lay tech or word typing I

also understand that this can be a bit of a strain so just you know write them down on a piece of paper use your iPhone and take a picture of it Dropbox has a nice new so try to find something that puts a lot of contrast if you especially if you use pencil because we're gonna check if they're edible readable and this is your responsibility to have a readable file in particular so I've had over the years not at MIT I must admit but I've had students who actually write a doc file and think that

converting into a PDF consistent erasing the extension do see and replacing it by PDF this is not how it works and so I'm sure you will figure it out please try to keep them letter size this is not a strict requirement but I don't want to see thumbnails either you were allowed to have too late homeworks and by late I mean 24 hours late okay no questions asked you submit them this will be counted you don't have to send an email awarness or anything like this beyond that it's going to be given that you have

one slack for one zero grade and it's like four too late homeworks you're gonna have to come up with very good explanation why you need actually more extensions than that if you ever do and particularly you're gonna have to keep track about why you've used your three options before there's gonna be two midterms one is October 3rd and one is November 7 they're both gonna be in class for a duration of the lecture when I say they last for an hour and 20 minutes it does not mean that if you arrive 10 minutes before the

end of lecture you still get an hour in 20 minutes it will end at the lecture at the end of lecture time for this as well no pressure only the best of the two will be kept and this will describe will count for 30% of the grade this will be closed books and closed notes so the purpose is for YouTube yes to I say the best of the two will be kept yes the best of the two are not the best to we will add them multiply the number by nine and that will be your

girl no so I am trying to be nice there's just a limit to what I can do all right so this will be the goal is for you to learn things and to be familiar with them in the in the final you will be allowed to have your notes with you but I want you the midterms are also way for you to develop some mechanism so that you don't actually waste too much time on things that you should be able to do without thinking too much you will be allowed to cheat sheets because well you

know you can always forget something and you will there will be two sided letters sheet and you can extract asyou self as writing as small as you want and you can put whatever you want on this teacher all right the final will be decided by the register it's gonna be three hours and it's going to account for 40% you can now bring books but you can bring your nose yes they're not oh yeah there's a one that's missing on both of them is in there yeah let's figure that out the syllabus is the true one

the slides are so that we can discuss but the ones that's on the syllabus is the one that cam and I think they're also posted on the calendar around stellar as well any other question okay so the prereqs here and who has looked at the first problem set already okay so those hands that are raised realize that there's a true prerequisite of probability for this class it can be at the level of 18 600 or 604 one I should say be now it's two classes I you know I will require you to know some calculus

and have some notions of linear algebra such as what is a matrix what is a vector how do you multiply those things together some notion of what our terminal vectors are and I'll remind you about so well about hygiene vectors and eigen values but I remind you all of that so this is not the strict prereq but if you've taken it for example doesn't hurt to go back to your notes when we get closer to this chapter on principle component analysis and the chapters as they're listed in the syllabus are in order so you will

see what it actually comes so there's no required textbook and I know this is a bit of a I mean I know you tend to not like that you like to have your textbook to know where you're going and what we're doing I'm sorry it's just this class either I would have to go to a mathematical statistics textbook which is just too much and or to go to a more you know engineering type statistics class which is just too little so hopefully the problems that will be enough for you to practice the recitations will have

some problems to solve as well and the material will be posted on the slides so you should have everything you need there's plenty of resources online if you want to expand on a particular topic or read it said by somebody else the book that I recommend in the syllabus is this book called all of statistics by Wasserman mainly because of the title I'm guessing it has all of it in it it's it's pretty broad there's actually not that many it's it's more of an intro grad level but it's it's not very deep but you see

a lot of the overview certainly what we're gonna cover will be a subset of what's in there so the slides will be posted on Stella before lectures before we start a new chapter and after we're done with the chapter with the annotations and also with typos corrected like for the exam so there will be some video lectures again the first one will be posted on OCW from last year but all of them will be available on stellar of course modulo technical problems but hopefully this is an automated system and hopefully it won't work out well

for us so if you somehow have to miss a lecture you can always catch it up by watching it you can also play it that speed point seven-five in case I end up speaking too fast but I think I've managed myself all right so just last warning all right so why should you study statistics well if you read the news you will see a lot of statistics I mentioned machine learning it's built on a lot of statistics statistics is now if I were to teach this class ten years ago I would have to explain to

you that you know data collection and making decisions based on data was something that made sense but now it's almost you know in our life were used to this idea that data helps making decision right and so people use data to conduct study so here I found a bunch of press titles that I think the key word I was looking for was study finds if I want to do this so this I actually did not bother doing it again this this year this is all 2016 2016 2016 but the key word that I look for

usually study fine so a new study find traffic is bad for your health so you know we had to wait 2016 for data to tell us that and and you know there's a bunch of other more slightly more interesting ones for example that one that you might find interesting is that the study finds that students benefiting benefit from waiting to declare major now there's a bunch of press titles one that studies you know one in the MIT News that studies finds brain connections key to reading and so here we sort of have an idea what

happened write some data was collected some scientific hypothesis was formulated and then the data was here to try to you know prove or disprove this scientific hypothesis right that's the usual scientific process and and we need to understand how the scientific process comes because some of those things might be actually questionable who is a hundred percent sure that study finds that students do you think that you benefit from waiting to declare a major right maybe some of you I mean I would be skeptical about this right I would be like I don't want to wait

to declare a major maybe this study so what kind of thing can we bring well maybe this study studied somebody people that were different from me or maybe this study find that a majority that this is beneficial for majority of people I'm not a majority I'm just one person there's a bunch of things that we need to understand what those things actually mean and we'll see that those are actually not statements about individuals they're not even statement about the cohort of people they've actually looked at their statements about a parameter of a distribution that was

used to model the the benefit of waiting right so there's a lot of questions and there are a lot of layers that come into this and we're not going to want to understand what was going on in there and try to you know peel it off and understand what assumptions have been put in there even though it looks like a totally legit study you know out of those studies you know statistically I think there's gonna be one that's gonna be wrong well maybe not one but they might view if I put a long list of

those there would be a few that would actually be wrong okay if I put 20 there would definitely be one that's wrong so you have to see that every time you see 20 studies one is probably wrong when they're studies about drug effects it would be a lot of the list of a hundred one would be wrong so we'll see what that means and what I mean by that okay so of course not only studies that make this discoveries are actually making the press title there's also the press that talks bounds things that make no

sense so I love this first experiment that that salmon experiment so actually was a graduate student came to came to neuroscience poster session pulled out this poster and explained the scientific experiment that it was conducting which consisted in taking a previously frozen in thought Simon salmon putting it in an MRI showing it pictures of violent violent images and recording its brain activity and he was able to discover a few voxels that were activated by those violent images and can somebody tell me what happened here was the salmon responding to the violent activity so basically this

is just the statistical fluke right that's just randomness at play there's so many voxels that are recorded and there's so many fluctuations right there's always a little bit of noise when you ran those things that some of them just by chance got lit up and so we need to understand how to correct for that in this particular instance we need to have tools that tell us that well finding three voxels that are activated for that many voxels that you can find in the Salmons brain is just a too small of a number maybe we need

to find a clump of 20 of them for example all right so we're gonna have mathematical tools that help us find those particular numbers I don't know if you saw this one by John Oliver about hacking or actually it's a key hacking and so basically what John Oliver is saying is actually a full length like that's these long segments on this and it was explaining how you know there's a huge amount okay there's a sociology question here about how there's a huge incentive for scientific scientists to publish results right you're not gonna say you know

what this year I found nothing and so people are trying to find things and just by searching it's as if they were searching for all the voxels in a brain until they find one that was just lit up by chance and so they just run all these studies and at some point one will be right just out of chant and so we have to be very careful about doing this there's much more complicated problems associated to what's called pea hacking which consists in violating the basic assumptions in particular looking at the data and then formulating

your scientific assumption based on the and then going back to it your I doesn't work let's just formulate another one and if you start doing this all bets are off okay so actually the statistics the theory that we're gonna develop is actually for a very clean use of data which might be a little unpleasant if you spent you know if you've had an army of graduate students collecting genomic data for a year for example maybe you don't want to say well you know I had one hypothesis didn't work let's throw up data to the trash

and so we need to find ways to be able to do this and there's actually a course being taught at vu I mean it's still in its early stages but something called adaptive data analysis that will allow you to do this kind of things questions okay so of course statistics is not just for you to be able to read the press statistics will probably be used in whatever career path you choose for yourself so you know it started in the 10th century in Netherlands for hydrology right Netherlands is basically in the water and the sea

level and so they were they wanted to build some types but you know once you're gonna be a little dyke you want to make sure that it's gonna you know sustain some tights and some floods and so in particular they wanted to build dikes that were high enough and but not too high you know you could always say well you know I'm gonna build you know a 500 meter dike and then I'm gonna be safe you want something that's based on data right you want to make sure and so in particular what did they do

well they collected data for previous floods and then they just found a tag that was gonna cover all those things now if you look at the data they probably had maybe it was scarce right maybe they had 10 data points and so for this the data points that maybe they wanted to sort of interpolate between those points maybe extrapolate for the larger one based on what they've seen maybe they have chances of saying something which is even larger than everything they've seen before and that's the problem that that's exactly the goal of statistical modeling being

able to extrapolate beyond the data that you have guessing what you have not seen yet might happen when you buy insurance for your car or your apartment or your phone there's a premium that you have to pay and this premium has been determined based on that how much you're in expectation going to cost the insurance right it says okay this person has a ten percent chance of you know breaking their iPhone next phone cost that much to repair so I'm gonna charge them that much and then I'm gonna add an extra dollar for my time

okay so that's basically how those things I determine and so this is using statistics this is basically you know where statistics is probably mostly used that was personally trained as an actuary and that's you know mean being a statistician in an insurance company clinical trials this is also one of the earliest success stories of statistics so it's actually now what spread if you every time a new drug is approved for market by the FDA and requires a very strict regimen of testing with data and control group and treatment group and how many people you need

in there and what kind of you know significance you need for those things of particular those things look like this so now it's more you know five thousand patients I mean it depends on what kind of drug it is but for us a 100 patients 56 were cured in 44 showed no improvement does the FDA consider that this is a good number that have a table for how many patients were cured is there a placebo effect do I need a control group of people that are actually getting a placebo it's not clear all these things

and so there's a lot of things that put to put into place and there's a lot of floating parameters so hopefully we're gonna be able to use statistical modeling to shrink it down to a small number of parameters to be able to ask very simple questions is a drug effective is not a mathematical equation but is P larger than 0.5 is a mathematical question okay and that's essentially what we're gonna be doing we're gonna take this is a drug effective to reducing to is a variable larger than 0.5 now of course genetics are using that

so you know that's typically actually the same type of size of data that you would see for fMRI data so given the genotype of so this is actually a study that I found so you have about 4,000 cases of Alzheimer's and 8,000 controls so people without Alzheimer that's what's called a control this is something just to make sure that you can see the difference with the you know people that are not affected by either a drug or a disease is the gene April II associated with Alzheimer disease right I mean everybody can see why this

would be an important question we now have CRISPR it's targeted to very specific genes if we could edit it or knock it down or knock it up or boost it maybe we could actually have an impact on that so those are very important questions because we have the technology to target those things but we need the answers about what those things are and so there's a bunch of other questions that you know the minute you're going to talk to biologists about say I can do there and say okay are there there are genes within the

gene particularly snips that I can actually look at and you know they're looking at their different questions now when you start asking all these questions you have to be careful because you're reusing your data again and it might lead you to wrong conclusions and and those are all over the place these days and that's why they go all the way to John Oliver talking about them any questions about those examples so this is really a motivation again we're not gonna just take this data set of those cases and look at them into the in details

so what is common to all these examples like why do we have to use statistics for all these things well there's the randomness of the data there's some effect that we just don't understand right so for example the randomness associated with the lining up of some voxels or the fact that you know as far as the insurance is concerned whether you're gonna break your iPhone or not is essentially a twin cost so fully it's biased but you know it's a twin cost so all these things from the perspective of the statistician those things are actually

random events and we need to retain this randomness to understand this randomness is it's gonna be a lot of random stories gonna be little randomness this is gonna be something that's like you know are out of the their people you know out of so let's see for example for the flaws right did the floods that I see were consistently almost the same size about with almost a rounding error or they're just really widespread all these things we need to understand so we can understand how to build those dikes or how to build some to make

decisions based on those data and we need to understand this randomness okay so you know they associated questions to randomness were actually hidden in the text so we talked about the notion of average right so as far as the insurance has come it's concerned they want to know in average what the probability is like what is your chance of actually have breaking your iPhone and that's that's what came in this notion of fair premium there's this notion of quantifying chance right we don't want to talk maybe only about average maybe you want to cover say

99 and of the floods so we need to know what is this the height of a flood that's higher than 99% of the floods but maybe there's 1% of them you know when doomsday come dooms they come right we're not gonna pay for it all right so that's most of the floods and then there's questions of significance right so you know we give I give this example a second ago about clinical trials I give you some numbers clearly the drug was cured more people than it did not but does it mean that it's significantly good

or was this just by chance maybe it's just that you know these people just recover is like you know curing the common cold and you feel like oh I got cured but it's really an you waited five days and then you got cure all right so there's this notion of significance of variability all these things are actually notions that describe randomness and quantify randomness into simple things randomness is a very complicated beast but we can summarize it in two things that we understand just like I am a complicated object I'm made of molecules and made

of genes I'm made of very complicated things but they can be summarized as you know my name my email address my height and my weight and maybe for most of you this is basically enough all right you will recognize me without having to you know do a biopsy on me every time you see me all right so to understand randomness you had to go through probability all right probability is the study of randomness that's what it is that's what the first sentence that a lecturer in probability will say and so that's why I need the

prerequisite because this is what we're gonna use to describe the randomness we'll see in a second how it interacts with statistics so sometimes and actually probably most of the time throughout your semester on probability randomness was very well understood when you saw a probability problem here was the chance of this happening here was the chance of that happening maybe you had more complicated questions that you had some basic elements to answer for example you know the probability that I have HBO is this much in the problem watch Game of Thrones is that much and given

that I play basketball what is the probably you know you had these crazy questions but you were able to build them and-and-and-and but all the basic numbers were given to you statistics will be about finding those basic numbers all right so some examples that you've probably seen were you know dice cards roulette flipping coins all this thing is our things that you've seen in a probability class and the reason is because it's very easy to describe the probability of each outcome for a diamond we know that each face is gonna come with probability one set

now I'm not gonna go in the debate of whether this is pure randomness or this is determinism I think as a model for actual randomness die well it's a pretty good number of flipping a coin is a pretty good it's a pretty good model okay so those are those are actually good so the questions that you would see for example in probabilities are the following I roll one die Alice gets $1 if the number of dots is less than three Bob gets two dollars of the number of dots is less than two do you want

to be Alice or Bob even that your role is actually to make money yeah you want to be Bob right so let's see why so if you look at the expectation of what Alice makes so let's call it a this is $1 with probability one-half right so three six so that's one half and the expectation of what Bob makes this is $2.00 with probability to six and that's one-third two-thirds okay which is definitely larger than one-half so you're this Bob's expectation is actually a bit higher okay so those are the kind of questions that you

may ask with probability I describe to you exactly you use the fact that the die would be on would get less than three dots with probability one-half we knew that and I didn't have to describe to you what's going on there you didn't have to collect data about a die same thing you know you roll two dice you choose a number between two and 12 and you win $100 if this if you choose the sum of the two dice which number do you pick what why seven as the most likely one right so your gain

here will be $100 times the probability that the sum of the two dice let's say X plus y is equal to your little Z where little Z is the number you pick okay so seven is the most likely to happen and that's the one that maximizes this function of Z okay and for this you need to study a more complicated function but it's a function that involves two die but you can compute the probability that X plus y is equal to Z for every Z between two and 12 okay so you know exactly what the

probabilities are and that's how you start probability okay so it's okay so here that's exactly what I said is you know you can you have a very simple process that describes basic events probably one six for each of them and then you can build up on that and come understand the probability of more complicated events you can throw in some money in there you can be build functions you can do very complicated things building on that now if I was a statistician a statistician would be the guy who just arrived on earth I've never seen

a Dainese to understand that it died comes with probability 1/6 on each side in the way we do it is just to roll to die until you know and get some accounts and try to estimate those things okay and maybe the guy would come and say well you know actually the probability that I get a one is you know 1/6 plus zero point zero zero one and the probability I get a two is 1/6 minus 0.005 and you know there would be some fluctuations around this and it's gonna be his role as a statistician to

say listen this is too complicated of a model for this thing and it's this should all be the same numbers just looking at the data it should be all the same numbers and that's part of the modeling you make some simplifying assumptions that essentially make your reason your your questions more accurate now of course if your model is wrong if it's not true that all the phases are right with the same probability then you're you have a model error here so we're we're will be making model errors but that's gonna be the price to pay

to be able to extract anything from our data okay so for more complicated processes right so of course nobody's gonna waste their time rolling dice I mean I'm sure you might have done this in AP Stat or something but the need is to estimate parameters from data alright so for more complicated things you might want to estimate some you know density parameter on a particular thing set of material and for this maybe you need to be something to it and measure how fast it's coming back and you're gonna have some measurement errors that maybe you

need to do that several times and you need to have a model for the physical process that's actually going on and physics is usually a very good way to get models for engineering perspective but there's models for sociology where we have no physical system right I mean you know God knows how people interact and maybe I'm gonna say that the way I make friends is by first flipping a coin in my pocket and with probably two thirds I'm gonna make my friend at work and was probably one third I'm gonna make my friend at soccer

and once I make my friend at soccer I I decided to make my friend soccer then I will face someone who's flipping the same coin with maybe slightly different parameters and you know those things actually exist right there's models about how friendships are formed and and this is the one I describe it's cool to mix membership model so those are models that are sort of egg pathi sized and they're more reasonable than you know there was taking into account all the things that made you meet that person at that particular time okay so the goal

here so based on data now once we have the model it's gonna be reduced to maybe two three four parameters depending on how complex the model is and then your goal will be to estimate those parameters so sometimes the randomness that we have here is real right so there's some true randomness in some surveys if I pick a random student as long as I believe that my random number generator that will pick your random ID is actually random there is something random about you right the student that I pick at random will be a random

soon the the person that I call on the phone is the random part right so there's some randomness that I can build into my system by drawing something from a random number generator a biased coin is a random thing right it's not a very interesting random thing but it is true it is a random thing again if I watch out the fact that it actually is a deterministic mechanism but a certain accuracy a certain granularity this can be thought of as a truly random experiment measurement error for example if you by some measurement device or

some some optics device for example you will have like standard deviation and things that come on the side of the box and it tells you this will be making some measurement there and it's usually you know there are moments maybe or things like this and those are very accurately described by some random phenomenon but sometimes I'd say most times there's no randomness there's no randomness it's not like you know you're breaking your iPhone is a random event this is just something that we sweeps randomness is a big rug under which we sweep everything we don't

understand and we just hope that in average we've captured sort of the average effect of what's going on and the rest it might look to it to the right might fluctuate to the left but that what remains is just sort of randomness that can be averaged out so of course this is that where the leap of faith is we do not have we do not know whether we're correct of doing this maybe we make some huge systemic biases by doing this maybe we will forget a very important component right for example if I have I

don't know let's think of something a drug for breast cancer alright and I throw out the fact that my patient is either men or women I'm gonna have some serious model biases right so if I say you know I'm gonna collect some random patients and I'm gonna start doing this I need there's some information that I really need clearly to build into my model okay and so the model should be complicated enough but not too complicated right so it should take into account things there will systematically the important okay so in particular you know the

simple rule of thumb is when you have a complicated process you can think of it as being a simple process and some random noise now again the random noise is everything you don't understand about the complicated process and the simple processes everything you actually do okay so good modeling and this is not what we'll be seeing in this class consists in choosing plausible simple models and this requires a tremendous amount of domain knowledge and that's why we're not doing it in this class this is not something where I can make a blanket statement about making

good model you need to know if I were statistician and working on a study I would have to grill the person in front of me the the expert for two hours to know oh but how about this how about that how about how does this work so it requires to understand a lot of thing there's this famous statistician and to whom this sentence is attributed in it it's probably not his then but Tookie said that he loves being a statistician because you get to play in everybody's backyard right so you get to go and see

people and you get to understand at least to a certain extent what their problems are enough that you can actually build a reasonable model for what they're actually doing so you get to do some sociology get to do some biology you get to do some engineering and you get to do a lot of different things right so he was actually at some point predicting the political the presidential election so you see you get to do a lot of different things but it requires a lot of time to understand what problem you're working on and if

you have a particular application in mind you're the best person to actually understand this so I'm just going to give you the basic tools okay so so this is the you know the circle of trust no this is really just the simple graphic that tells you what's going on when you do probability you're given the truth somebody tells you what died God is rolling all right so you know exactly what the parameters of the problems are and what you're trying to do is to describe what the outcomes are gonna okay you can say well there's

you know if you're rolling a fair die you're gonna have 1/6 of the time in your data you're gonna have one one side of the time you're going to have - and so you can describe if I told you what the truth is you could actually go into a computer either generate some data or you could describe to me some you know more macro properties of what the data would be like oh I would see a bunch of numbers that would be centered around 35 if I drew from a Gaussian distribution centered at 35 right

you would know this kind of thing I would know that you know it's very unlikely that if my Gaussian has standard deviation is centered zero and zero say with standard deviation three it's very unlikely that I will see numbers between below minus 10 and above 10 right you know this that you basically will not see them so you know from the truth from the distribution of a random variable that does not have mu or Sigma's really numbers there you know that you're gonna be what data you're gonna be having statistics is about going backwards it's

saying if I have some data what was the truth that generated it and since there's so many possible truth modeling says you have to pick one of those simpler possible troops so that you can average out statistics basically means averaging your averaging when you do statistics and averaging means that you know if I say that I received the so if I collect all your GPS for example in my model is that well the possible GPAs or any possible numbers and anybody can have any possible GPA this is gonna be a serious problem but if I

can summarize those GPAs into two numbers say mean and standard deviation then I have a pretty good description of what is going on rather than having to have to predict a full list right if I learn a full list of GPA that I say well this was the distribution then it's not gonna be of any use for me to predict what an X GPA would be or some random student walking in or something like this okay so just to finish my rant about probability versus statistics this is a question you would see them in a

parable this is a probabilistic question and this is a statistical question the probabilistic question is preview studied showed that the drug was 80 percent effective so you know that this is the effectiveness of the drug it's given to you this is how your problem starts we can anticipate that for a study on 100 patients in average AE will be cured and at least 65 will be cured with 99% chances so again this are not I'm not predicting on 100 patients exactly the number of them that are gonna be cured and the number of them that

are not but I'm actually sort of predicting what things are gonna look like in average or some macro properties of what my datasets will look like okay so with 99 percent chances that means that in ninety nine point ninety nine percent of the datasets you will draw from this particular draw 99.99 percent of the cohort of 100 patients to whom you administer this drug I will be able to conclude that at least 65 of them will be cured on 99.99 percent of those datasets okay so that's a pretty accurate prediction of what's gonna happen statistics

is the opposite it says well I just know that 78 out of 100 were cured I have only one data set I cannot make predictions for all data sets but I can go back to the probability makes them make some inference about what my probability would look like and then say okay then I can make those predictions later on so when I start with 70 78 over a hundred then maybe I'm actually in this case I just don't know my best guess here is that you know I'm confident I can I have to add the

extra error that I bet you making by predicting that here the drug is not 80% effective but ninety eighty percent 78 percent effective and I need some error bars around this that will hopefully contain eighty percent and then based on those error bars I'm gonna make slightly less pretty precise predictions for the future okay so to conclude so this is why statistic so what is this course about it's about understanding the mathematics behind statistical methods it's more of a tool we're not gonna have fun and talk about alter Grieg geometry it's just for fun in

the middle of it so justify quantitative statements given by given some modeling assumptions we will in this class mostly admit that the modeling assumptions are correct in the first part in this introduction we'll go through them because it's very easy to forget what assumptions you're actually making but this will be a pretty standard thing the words you will hear a lot are iid I did independent and identically distributed that means that your data is basically all the same and one data point is not impacting another data point hopefully we can describe some interesting mathematics arising

in statistics you know if you've taken linear algebra maybe we can explain to you why if you've done some calculus maybe we can do some interesting calculus we'll see how you know how in this period of applied math those things answer interesting questions and basically we'll try to carve out a map tool box that's useful for statistics okay and and maybe you can extend it to more sophisticated methods that we did not cover in this class in particular in your machine learning class hopefully you'll be able to have some statistical intuition about what is going

on so what this course is not about it's about it's not about you know spending a lot of time looking at data sets and trying to understand some statistical thinking one of kind of questions so this is more of an apply statistic perspective on things or more modeling so I'm gonna typically give you the model let's say this is a model and this is how we're gonna you know build an estimator in this in the framework of this model so for example 1807 five to certain extent a school statistical thinking and they did analysis I'm

hoping there is some statistical thinking and there we will not talk about software implementation unfortunately there's just too little time in a semester there's other courses that are giving you some overview so the main software's these days are so are is the leading software I'd say in statistics both in academia and industry lots of packages one everyday that's probably coming out but there's some things right so now you know Python is probably catching up with all this all this you know sick of learned packages that are coming up julia has some some statistics in there

but really are if you were to use just go software and let's say you love doing this this would be the one that would be proved most useful for you in the future does not kill super well - big - high dimensional data so there's a class in 90s s that actually uses are as called IDs so twelve anything school statistics compute well okay so just accompiish inand applications or nothing like this I'm also preparing with a Peter Kempthorne a course called computational statistics it's gonna be offered this spring as a special topics and so

Peter kempthorne will be teaching it and this class will actually focus on using R and even beyond that it's not just gonna be about using it's gonna be about understanding just same we're gonna see how math helps you do statistics is gonna help see how math helps you do algorithms for statistics I said we'll talk about maximum likelihood estimator we'll need to maximize some function there's optimization toolbox to do that and we'll see how we can have specialized for statistics for that and what are the principles behind it and you know of course if you've

taken AP stats you probably think that stats is boring to death because it was just a long laundry list that's been a lot of time on t-test I'm pretty sure we're not gonna talk about t-test a lot maybe once but this is not a matter of saying you know you're gonna do this and this is a slight variant of it we're gonna really try to understand what's going on so admittedly you have not chosen the simplest way to get any in statistics on campus all right this is not the easiest class it might be challenging

at time but I can promise you that you know you will maybe suffer but you will learn something by the time you're out of this class this will not be a waste of your time and and you will be able to understand and not having to remember by heart how those things actually work are there any questions anybody wants to go to another stats class on campus maybe it's not too late ok so let's do some statistics so I see the time now and it's 1156 so we have another 30 minutes I will typically give

you a you know three four minutes break if you want to stretch wanna run to the bathroom if you want to check your texts or our or Instagram I there there was very little content in this class hopefully it was entertaining enough that you don't need the break but just in the future so you know you will have a break okay so statistics Caixa's is this is how it starts all right I'm French what can I say I need to put some French words so this is not how office hours are gonna go down so

anybody knows this sculpture by without the kiss it's you know maybe probably and the thinker is more famous but this is actually pretty famous one but is it really this one or is it this one anybody knows which one it is this one or this one what's that it's this one yeah anybody who votes for this one okay who votes for that one thank you I love that you do not want to pronounce yourself with no data actually to make any decision this is a total coin toss right turns out that there is data and

there's in the very serious journal Nature someone published a very sirs paper which actually looks pretty serious right if you look at it's like human behavior adult persistence of head-turning asymmetry eyes a lot of fancy words in there and this I'm not kidding you this study is about collecting data of people kissing and knowing if their been their head to the right or if they've been they head to the left and that's all it is and so a neonatal right side preference makes the surprising romantic reappearance in later life there's an explanation for it alright

so if we follow this this nature which one is the one this one or this one this one right so you know head to the right and to be fair for this class I was like oh I'm gonna go and show them what you know Google Images does when you google google you know kissing couple it's inappropriate after me the first picture and so I cannot show you this but you know you can check for yourself though I would argue so this person here actually went out in airports and took pictures of strangers kissing collecting

the data and can somebody guess why did it just not stay home and collect data from Google Images by just googling you know kissing couples what's wrong with this data I didn't know actually before I actually went on Google Images what's that it can be altered but you know who would want to do this I mean there's no particular reason why you would want to flip an image before putting it out there I mean you know you might but maybe then you want to hide the brand of your gap shirt or something yeah yeah that's

very true and actually it's even worse than that the people who post pictures of themselves are not posting pictures of themselves are putting pictures of the people that take took a picture of and they usually is a stock you know watermark on this and it's basically stock images those are actors and so they've been directed to kiss and this is not a natural thing to do and actually if you go on Google Images and I encourage you to do this unless you don't want to see an appropriate picture I mean they're mightily inappropriate and basically

you will see that this study is actually not working at all I mean I look briefly I didn't actually collect numbers but I didn't find a particular tendency to been right if anything it was actually probably the opposite and it's because those people were directed to do it and just don't they actually think about doing it okay and also because I think you need to justify writing a nature paper more than I said in front of my computer so again this first sentence here a neonatal right side preference is there a right side preference is

not a mathematical question but we can start saying let bla and put some variables and ask questions about those variables alright so you know X is actually not a variable let's use very much in statistics for parameters but P's one for prep for parameter and so you're gonna take your parameter of interest P and it's here is gonna be the proportion of couples and that's among all couples of so here you know if you if you talk about statistical thing there would be a question about what prop population this would actually be representative of so

usually this is a call to your sorry I should not forget this word it's important for you okay I forget this word so this is okay so if you look at so if you look at this proportion maybe the these couples that are in this study might be representative only of couples in airports maybe they actually put on this show for the other passengers who knows you know oh let's just do this well and just like the people in Google Images are actually doing it so maybe you want to just restrict it but of course

clearly if it's appearing in nature it should not be only about couples in airports it's supposedly representative of all couples in the world and so here let's just keep it fake but you need to keep in mind what population this is actually making a statement about so you have this full population of of people in the world all right so those are all the couples and this person went ahead and collected data about a bunch of okay and we know that in this thing there's basically a proportion of them that's like pee and that's the

proportion of them that's bending their head to the right and this is so everybody on this side is bending their head to the right and hopefully we can actually sample this thing uniform that's basically the process that's going on so this is the statistical experiment we're gonna observe end kissing couples right so here we're gonna put as many variables as we can so we don't have to stick with numbers and then we'll just plug in the numbers in kissing couples and n is also also in statistics by the way n is the size of your

sample 99.9 of the time and collect the value of each outcome okay so we want numbers we don't want write a left so we're gonna coat them by 0 1 pretty naturally and then we're gonna estimate P which is unknown right so P is this area and we're gonna is limit it simply by the proportion of right right so the proportion of crosses that actually fell in the right side okay so in this study what you will find is that the numbers that were collected were 124 couples and that out of those 124 80 of

them turned their head to the right okay so P hat as a proportion how do we do it well don't need statistics for that you're gonna take 80 divided by 124 and you will find that in this particular study sixty four point five percent of the couple who are bending their head to the right that's a pretty large number right the question is if I picked another 124 couples if you had different airports different times would I see the same number with this number be all over the place would it be sometimes very close to

120 or sometimes we're close to ten or would it be is this number actually fluctuating a lot right and so hopefully not too much right so 64.5% is definitely much larger than 50 percent and so there seems to be this preference okay and now we're gonna have to quantify how much of this preference is this number significantly larger than 50 percent okay so if our data for example was just three couples okay I'm just going there I'm going to log in I call it I do right left right and then I see and then I

see what's the name of the fish place there there's you know yeah i go to i go to wall burgers at Logan and I'm like okay I'm done for the day I collect this data I go home and I'm like well 66.7% to the right that's a pretty big number it's even farther from 50 percent and this other guy so I'm doing even better but of course you know that this is not true right three people is definitely not representative if I stop that the first one I would have ax the first two I would

have even 100% so the question that statistics is going to help a sensor is how large should the sample be for some reason I don't know if you guys received a time affiliate with the Broad Institute and since then I receive one email per day that says sample size determination how large should you sample be like I know how large should my example be I've taken 18 650 multiple times and so I know but the question is is five 124 a large enough number or not well the answer is actually as usual it depends it

will depend on the true unknown value of P but from those particular values that we got to 120 and how many couples was there at 80 we actually can make some some question right so here we said that 80 was larger than 50 was allowing was allowing us to conclude at 64.5% so it could be one reason to say that it was larger than 50% 50% of 124 is 62 so the question is would I be would I be willing to make this conclusion that's 63 right is that a number that would convince you who

would be convinced by 63 who would be convinced by 72 who would be convinced by 75 hopefully the number of hands that are raised should grow who would be convinced by 80 all right so basically those numbers actually don't come from anywhere this 72 would be the number that you would need for a study most statistical studies would be the number that they would retain it's not for 124 you would need to see 72 that turn their head right to actually make this conclusion okay and then 75 so we'll see that there's many ways to

come to this conclusion because as you can see this was published in nature with with 80 so that was okay so 80 is actually very large number this is a ninety-nine point this is 99 percent no so this is 95 percent confidence this is 99 percent confidence and this is 90 nine point nine percent confidence alright so if you said eighty you are a very conservative person starting at 72 you can start making this conclusion okay so to understand this we need to do our little mathematical kitchen here and we need to do some some

modeling okay so we need to understand by modeling we need to understand what random process we think this data is generating from right so it's gonna have some unknown parameters unlike in probability but we need to have just basically everything written except for the values of the parameters right when I said a die is coming uniformly with probability 1/6 then I need to have say maybe with probability maybe I should say you know here are six numbers and I need to just fill those numbers okay so we're I call 1 to n I'm gonna define

RI to be the indicator an indicator is just something that takes value 1 if something is true and 0 if not so it's an indicator that I've couple turns the head to the right okay so RI so it's indexed by I and it's 1 if the ice couple turns their head to the right and 0 if if it's well actually I guess they can probably kiss straight right so that would be weird but they might be able to do this so let's say not right okay then the estimator of P we said was P hat

it was just the ratio of two numbers but really what it is is I count I sum those are eyes since I only add those that take value 1 what this thumb is it's really this sum here is actually just counting the number of ones which is another way to say it's counting the number of couples that are kissing to the right and here I don't even have to tell you anything about the numbers or anything I can only keep track of first couple is a zero second couple is a one third couple is a

zero they're the data set you can actually find out online is actually a sequence of zeros and ones not clearly for the question we're asking about this proportion I don't need to keep track of all this information all I need to keep track of is the number of zeros and the number of ones that's those are completely interchangeable not there's no matter there's no time effect in this there's no the first couple is no different than the fifteenth couple okay so we call this our end bar and that's gonna be a very standard notation that

we use our might be replaced by other letters like X so xn bar yn bar and this thing essentially means that I the ours or dris over n of them and the bar means the average okay so I divide by n the total number of ones okay so here this sum was equal to 80 in our example and n was equal to 124 now this is an estimator so an estimator is different from an estimate an estimate is a number my estimate was sixty four point five my estimator is this thing is this thing where

I keep all the variables free and in particular I keep those variables to be random cuz I'm gonna think of a random couple kissing left or right as the outcome of a random process just like flipping a coin beginning heads or tails okay and so this thing here is the random variable alright and this average is of course an average of random variables its itself a random variable so an estimator is a random variable an estimate is the realization of a random variable or in other words is the value that you get for this random

variable once you plug in the numbers that you've collected so I can talk about the accuracy of an estimator accuracy means what well what would we want for an estimator maybe you would want it to fluctuate too much right it's a random variable so I'm talking about the accuracy of a random variable so maybe I don't want it to be too volatile right I could have one estimator which would be just throw out 182 couples keep only two and average those two numbers that's definitely your worst estimators than keeping all of the hundred and twenty-four

so I need to find a way to say that and what I'm gonna be able to say is that the number is gonna be fluctuating if I take another two couples I'm gonna be able to I'm probably gonna get a completely different number but if I take another 124 couples two days later maybe I'm gonna have a very number that's very close to 64.5% so that's one way the other thing we would like about this estimator it's actually it's not it's maybe it's not too volatile but also we want it to be close to the

number that we're looking for right if the number here's an estimator it's a beautiful random variable seventy-two percent that's an estimator go out there just do your favorite study about drug drug performance and and then you know they're gonna call you up on you know MIT student taking statistics they say so how are you gonna be able to your estimator that we've collected those five Taliban I said no I'm just gonna spit out seventy-two percent whatever the data says that's an estimator it's the stupid this demeanor but it is an estimator but this is the

meter is very not volatile every time you're gonna have a new study and then even if you change field they're still going to be 72% this is beautiful and the problem is that's probably not very close to the value are actually trying to estimate and so we need two things we need our estimator to be a random variable so think in terms of of densities we want the density to be pretty narrow right we want this thing to have very little to be right so this is definitely better than this but also we want the

number that we're interested in P to be very close to this to be close to the values that this thing is likely to take if P is here this is not very good for us okay so that's basically the thing we're gonna be looking at the first one is referred to as variance the second one is referred to as bias those thing come all over in statistics okay so we need to understand a model right so here's the model that we have for this particular problem so we need to make assumptions on the observations that

we see right so we said we're gonna assume that the random variable that's not too much of a leap of faith we're just sweeping on the rock everything we don't understand about those couples and the assumption that we make is that each RI is a random variable okay this one you will forget very soon the second one is that each of the our eyes is so it's a random variable that takes values 0 & 1 anybody can suggest the distribution for this random variable what Bernoulli write and it's actually beautiful this is where you have

to do the least statistical modeling the random variable that takes value 0 1 is always a bernal that's the simplest variability you can ever ever think of right any variable that takes only two possible values can be reduced to a burnin okay so the Virna lien and here we make the assumption that it actually takes parameter P right and there's an assumption here anybody can tell me what the assumption is yeah it's the same right that could have said P I but its P and that's where I'm gonna be able to start getting to do

some statistics is that I can start to be able to pull information across all my guys if I assume that they're all P I is completely uncoupled with each other then I'm in trouble there's no nothing I can actually get and then I'm gonna assume that those guys are mutually independent and most of the time they will just say independent meaning that you know it's not like all this guy called each other and it's actually a flash mob and they were like let's all turn our head to the left and then you know this is

definitely not gonna give you a valid valid conclusion okay so again randomness is a way of modeling lack of information here there is a way to figure it out maybe I could have followed all those guys and know exactly what they were maybe I could have you know looked at pictures of them in the womb and guess what how they were turning oh by the way that's one of the conclusions they're guessing that we turn our head to the right because our head is mostly turned to the right in the womb so you know we

don't know what goose goes on in the kissters minds and there's you know physics us ology there's a lot of things that could help us but it's just too complicated to keep track of or too expensive for many instances now again the nicest part of this modeling was the fact that the our eyes take only two values which mean that this conclusion that they were Bernoulli was totally free for us once we know it's random verbal it's a burning now they could have been as we said they could have been a Burnley with parameter P

I for each I I could have put a different parameter but I just don't have enough information right what would I said I would say well the first couple turn to the right P I p1 has to be one that's my best guess right the second couple kids to the left well p2 should be 0 that's my best guess and so the the basically I need to have to be able to average my information and the way to it is by coupling all these guys P is to be the same P for all I okay

does it make sense here what I'm assuming is that my population is how much genius okay maybe it's not maybe I could actually look at a finer grain but I'm basically making a statement about a population and so you know maybe you kiss to the left and then you're not I'm not making a statement about a person individually I'm making a statement about the overall population now independence is probably reasonable right this person just went and you know you can seriously hope that these people this couple did not communicate with each other or that you

know Kenya did not text that we should all all turn our left in our head to the left now and there's no like you know external stimulus that forces people to do something different okay so sorry about that since we have about less than ten minutes let's do a little bit of exercises that okay with you so I just have like some exercises so we can see what an exercise is gonna look like this is you know sort of similar to the exercise you will see that maybe we should do it together okay so now

we're gonna have I have a test okay so that's an exam right in probability okay and I'm gonna have 15 students in this test and hopefully this should be 15 grades they're representative of the grades of all a large class right so if you took you know 18600 it's a large class there's definitely more than 15 students and maybe just by sampling 50 students at random I want to have an idea of what my great distribution will look like okay I'm grading them I want to make did you get it yes okay so I'm gonna

make some modeling assumption for those guys okay and so here so 15 students and the grades are X 1 2 X 15 right just like we had r1 r2 all the way to our 124 those were my our eyes and so now I have my X eyes and I'm gonna assume that X eye follows a Gaussian or normal distribution with mean mu and variance Sigma square now this is modeling right nobody told me there's no physical process that makes this happen we know that there's something called the central limit theorem in the background that says

that you know things tend to be Gaussian but this is really a matter of convenience actually this is if you think about it this is terrible because this puts nonzero probability on negative scores I'm definitely not gonna get a negative score but you know it's good enough because I know the probability is nonzero but it's probably you know 10 to the minus 12 so I'm I would be very lucky to see it negative scores so here's the list of grades so I have 65 4179 T 58 82 76 78 maybe I should have done it

with 8 59 69 sitting next to each other 84 89 134 51 and 72 okay so those are the scores that I got they were clearly some bonus points over there and the question is fine that's tomato's estimator from you what is my estimator from you well estimator again is something that depends on the random variable alright so mu is the expectation right so a good estimator is definitely the average score okay just like we had the average of the our eyes now the excise no longer need to be zeros in one so it's not

gonna boil down to being a number of ones divided by the total numbers now if I'm looking for an estimate well I need to actually sum those numbers and divide them by 15 right so my estimate is gonna be 1 over 15 and then I'm gonna start summing those number 6 to 5 plus 72 ok and I can do it and it's 67.5 ok so this is my estimate now if I want to compute a standard deviation so let's say estimate for Sigma you've seen that before right an estimate for Sigma is what an estimate

for Sigma and we'll see methods to do this but Sigma square is the variance with the expectation of X minus expectation of X square and the problem is that I don't know what those expectations are and so I'm gonna do what 99.9% of Statistics is what this statistics about that's my moto statistics is about replacing expectations with averages that's what all the statistics is about there's 300 pages in a purple book called off statistics that tell you this all right and then you do something fancy maybe you minimize something after you replace the expectation maybe

you need to plug in other stuff but really every time you see an expectation you replace it by an average ok let's do this so Sigma square hat will be what is going to be 1 over n sum from I equal 1 to N of X I - well here I need to replace my expectation by an average which is really this average I'm gonna call it new hat square there you go I've replaced my expectation with average okay so the golden thing is take your expectation and replace it with this all right frame it

get a tattoo I don't care but that's what it is if you remember one thing from this class that's what it is now you can be fancy if you look at your calculator is gonna put an N minus one here because it wants to be unbiased and those are things are gonna come too but let's say right now we stick to this and then when I plug in my numbers I'm gonna get an estimate for for Sigma which is the square root of this estimate of the estimator once I plug in the numbers and you

can check that the number you will get will be eighteen okay so those are basic things and if you've taken any ap stats this would be this should be completely standard to you now I have another list and I don't have time to see it my doesn't really matter okay we'll do that next time this is fine we'll we'll see another list of numbers and see we're gonna think about modeling assumption the goal of this exercise is not to compute those things it's really to think about modeling assumptions is it reasonable to think that things

are iid they're able to think that they have all the same parameters that they're independent etc okay so one thing that I wanted to add is by probably by tonight I'm so I will try to use in the spirit of you I don't know what starting happened in the spirit of using my iPad and fancy things I will try to post some videos so for in particular who has who has never used a statistical table to read say the quintiles of a Gaussian distribution okay so there's several of you I will this is a simple

but boring exercise I will just post a video on how to do this and you will be able to find it on stellar it's gonna take five minutes and then you will know everything there is to know about those but that's something you need for the first problem set by the way so the problem set has 30 exercises and probability you need to do 15 and you need you only need to turn in 15 you can turn in all of 30 if you want but you need to know that by the time we hit those

things you need to know well actually by next week you need to know what's in there so if you if you don't have time to do all the homework and then go back to your probability class to figure out how to do it just do 15 easy that you can do and return those things but go back to your probability class and make sure that you know how to do all of them those are pretty basic questions and those are things that I'm not gonna slow down on so you need to remember that the expectation

of the product of independent random variables are the product of the expectations some of the expected expectation of the sum is sum of the expected this kind of thing which is a little silly but just requires you practice so just you know have fun those are simple exercises you will have fun remembering your probability class all right so I'll see you on Tuesday or Monday you you