Allen Downey - A future of data science

16.4k views8828 WordsCopy TextShare

Posit PBC

In the hype cycle of data science, I suggest that the "peak of inflated expectations" was in 2012, t...

Video Transcript:

Good morning. Thank you. It's really it's great to be here. I've really enjoyed the conference, and I'm happy to talk to you today. I wanna say a little bit about how I got here, how we as data scientists got here, and then where do we go from here? What is a future of data science? So this is an opinionated talk, and I wanna give you a chance to talk back to me, not, like, during the talk, but we'll do Q&A and I'll be around afterward. So tell me what you think. I'm gonna start with my

most successful data science project. In 2003, when my wife and I were expecting our first baby, I Googled this question. Are first babies more likely to be late or early or both? And what I got at the time was anecdotal evidence. My cousin had two babies, and they were both early, and therefore, I don't know. What we need here is data, and the National Survey of Family Growth has data. This is a survey conducted by the CDC. It started in 1973. They've done eleven repetitions. They have more than a hundred thousand respondents now. And for

female respondents, they have details on their pregnancies and pregnancy length, including duration in week in weeks. And so I thought if we're asking this question around week 35, which is when this question becomes a little bit more pointed, we can ask what is remaining lifetime or remaining duration of a pregnancy at that time. And you can see here the blue curve is for first babies, and the orange curve is all other babies. And this is the distribution of total duration in weeks. And you can see that, in fact, first babies are a little bit less

likely to be right at the nominal 59 weeks, and a little bit more likely to be a couple of weeks late. So now if you Google this question, the first two hits on the page are me. And if you go farther down the page, you'll get this BBC article on the topic, which cites me. So I am now the world renowned expert on this topic. How did this happen? So why am I saying that this is my most successful project? Because I took a question from not answered, and I put it over into the answered

column of questions. And this was not hard to do. I needed data and it was freely available, open data. Thank you to the NSFG. Simple methods, basic visualization, no fancy statistics, all free open software, and a venue to make it visible, which was my blog. And that's all we need. This is an example of what I hope data science is and can be, which is a set of tools and processes for answering questions, resolving debates, and making better decisions. So that's how I got here. How did we get here and where are we going? And

to think about this, I think the Gartner hype cycle will help. If you are not familiar with it, this is the idea that new things, new technologies go through a sequence of phases with these catchy names. There's a technological trigger that gets things started. There is a peak of inflated expectations, a trough of disappointment, the slope of enlightenment, and then a plateau of productivity. So what is the technology trigger that created data science? I'm gonna suggest it's ENIAC. This is the first programmable electronic general purpose digital computer in 1945. So how did this create data

science? Here's my thesis and here's the first opinionated part of this talk. Data science exists because statistics as a field missed the boat on computers. And let me explain what I mean by that, especially if we look at statistics education. Most people, if you take the canonical introductory statistics class, you learn the central limit theorem and you learn a few special cases where sampling distributions have a nice mathematical plot. And then people graduate from that class, and they encounter data for the first time, and they ask for help. And they go to the Reddit statistics

forum and they ask, which test should I use? This is the question over and over on Reddit statistics, which test should I use? Because their education has given them the idea that all of the problems have been solved. You just have to know which test to use, And if you don't, that's your fault. That's not true. I think the data science approach, the computational approach says approximately none of the problems have been solved. What we need is a versatile set of tools to compose the solutions that we need. And what is the most versatile tool

that we have? A programmable electronic general purpose digital computer. So I want to demonstrate this point, the difference between mathematical statistics and computational statistics by teaching you all of statistical inference in ten minutes. And we're gonna do this by testing the variability hypothesis. If you're not familiar with this, it is the idea that in many species, males are more variable than females on many dimensions. It is a controversial idea. It has a long and interesting history. So if you're interested, do read about that. As my example, I'm gonna try to use something minimally problematic, and

we'll just talk about the difference in height between men and women. We need data. Again, the CDC is here to help us. The Behavioral Risk Factor Surveillance System, the BRFSS, is another repeated cross sectional sample of adults in the US. They get more than 400,000 respondents during each cycle, and it includes self reported heights and weights. So as a warm up, let's just look at the difference in height between men and women. I'm gonna do this by resampling. This is one of the core tools of computational statistics. It's the idea that you take the sample

that you actually collected, use it to build a model of the population that you're interested in, and then use that model to generate lots and lots of synthesized samples. That's the idea of resampling. More specifically with bootstrap resampling, we're gonna do that by drawing from the original sample a new sample that is the same size drawn with replacement. And so now let's see what that looks like. In code, it's a lot shorter to say in code than it was for me to say in words. There's what it looks like. It's a function that takes a

pandas data frame and it uses the sample method to generate a new data frame that is the same length n and it samples with replacement. So that's it. That's the bootstrap resampling process. Now we need a test statistic, and we're gonna start with just height, average height. So here's a function now that computes my test stat, which is the difference in height between the two groups. It uses the groupby method to divide the entire sample by sex and then extract the height column. And then the second line there computes the mean in each group and

then diff computes the difference between those two means. So it's the difference in means between the heights of two groups. The last thing I'm gonna do is repeat that a thousand times. So this list comprehension runs resample a thousand times, so it generates new data frames. And then for each one, it computes that test statistic. Here's what the results look like. This is the distribution of those results. This is the sampling distribution of the difference in height between men and women, and it looks like men are taller than women on average. So you've all learned

something this morning. We can now use that sampling distribution to answer a couple of questions. So one is, how precise is that estimate? In other words, if we had run this experiment over and over, how much would we expect that estimate to vary from one sample to another? And we can do that by computing a confidence interval that contains ninety five percent of those iterations that we just computed. And from that, we can say men are taller by about fourteen point forty three centimeters. I can put a confidence interval on that, and because my sample

size is super big, the confidence interval is super small. That's the first question. Second question is, if we had collected samples like this over and over, is it possible that it could have gone the other way? That in this sample, maybe women would have been taller than men, and not surprisingly, we'll find that that is unlikely. We can compute a p-value. Here's the function that does it. And this is based on an assumption that the tail of that sampling distribution behaves like a normal distribution. And in that case, I can use the cdf of the

normal distribution right there to compute the probability that that difference we saw could have been on the other side of zero. In this case, the p-value is super small, so we conclude that it is unlikely that we could see a difference that big by chance. Now if you know some mathematical statistics and in this group, I suspect probably looked at this example and said, wait a minute. You just did a difference in mean. That's a t-test. We could have done that in one line of code. I could have even looked it up in a table.

Why are we doing all this? Well, we didn't really care about the difference in means. We really care about the difference in standard deviation or some other measure of variability. So, okay, how are we gonna do that? With mathematical statistics, we now have to go back to the drawing board and start over. Okay? It's not a t-test. What's the test for comparing the difference in standard deviations? I don't know. We gotta go back to Reddit and find out. Whereas with mathematical statistics, I can take that example that I just showed you, seven lines of code,

and I'm gonna make it do the difference in standard deviation like that. It was kind of subtle. Let me make that a little clearer. Here's the difference in means. Here's the difference in standard deviations. This is the nice thing about computational statistics. I can do any test statistic just as easily. Here's what that distribution looks like. This is the sampling distribution of the difference in standard deviations. And once again, it's precise because my sample size is so big and the p-value is quite small. So this is saying that at least as measured by standard deviation,

men are more variable than women. But maybe standard deviation isn't the right thing to measure because almost any growth process, if it makes bigger things, it probably has more variability. So we might be more interested in standard deviation relative to the mean. And so maybe we should use the coefficient of variation instead because that's the ratio of standard deviation to mean. And, so we could go back to the Reddit statistics forum and ask how to compare coefficients of variation. Or again, we can go back to this example and change it so that it looks like

that. So not too bad. It got a little bit more complicated only because coefficient of variation isn't a built-in function, so I had to implement it. But we're still at a grand total of nine lines of code. And now we can see that the confidence interval is small and the p-value is small. So this result is statistically significant, but the difference is really small and probably has no consequences in real life. And it might actually just be the result of some data errors. There are suspiciously short and tall people in this dataset. So I guess

our answer is this dataset doesn't provide much support for the variability hypothesis, but that wasn't really the point of this whole thing. The point of this is that mathematical statistics only gets you so far. Really, there is only one way to do hypothesis testing, and that is this framework. This is the there is only one test framework. It always starts with a dataset, and you have to choose a test statistic that quantifies the effect, the size of the effect that you are interested in. You use the data to create a model of the population. We

did that by bootstrap resampling, but there are other ways to do it. You use that model now to generate lots of simulated data sets. For each data set, you compute the same test statistic, collect all of the results, that's your sampling distribution, and from that, you can compute a confidence interval or a p-value. That is it. That is all of statistical inference in 10 minutes and nine lines of code. So here's the the analogy I like if you have not watched this channel where they build Lego mechanisms. Mathematical statistics is like the level one car

that hits the first barrier and it gets stuck and it can't get over it. And computational statistics is like the last car that they build because it's got six wheels and a propeller and it has stilts and it can climb over anything. So there's my claim. The technological trigger for data science was computation because statistics as a field missed the boat on computation. And with apologies to people in the room if you identify as a statistician, I'm really talking about the field. The field missed the boat and it missed a lot of other boats. So

I think that's why data science came to be. And now the question is, have we hit the peak of expectations? Are we in the trough of disillusionment? Where are we here? So for the peak of expectations, I wanna nominate 2009 to 2012. In 2009, we got the Netflix prize. And then in 2010, the first iteration of Kaggle, which turned machine learning into a spectator sport. In 2011, we got Moneyball, which turned spectator sports into machine learning. Coursera turned machine learning and made it available to everybody. And then in 2013, everybody who graduated got the sexiest

job of the 21st century, which I'm gonna nominate as the peak of inflated expectations. And when we look at it now, it's kind of like the peak of cringe. So if that was the peak, what about the trough of disillusionment? I'm gonna nominate 2016 to 2018 as the trough. 2018, Cathy O'Neil introduced us all to the dark side of big data. ProPublica got us thinking about algorithmic fairness and criminal justice. And famously, data scientists failed to predict the outcome of the 2016 election. Now that's not exactly correct, but that is how people remember it. 2017

from former executives at Facebook, we got some of the first evidence of the harms of social media. We got alarms about facial recognition, fairness, and race. We got warnings about algorithms and fairness and gender. We learned that Cambridge Analytica had been misusing Facebook data. We learned that Google had been misusing medical data. And just to cap it all off, we learned that machine learning is ruining spectator sports. So that was not great. And I think that's how we ended up where we are right now in the trough of disillusionment. So where do we go from

here? How do we climb the slope of enlightenment, and what does the plateau of productivity look like? So I wanna suggest two things that make me optimistic about where we go from here, one that makes me nervous, and then I have a plea for what you can all do to help us get up that slope. So here's one thing that makes me feel good. In 1985, we had some of the first data journalism. The USA Today started publishing infographics, including this hard hitting piece about which parts of laundry people really dislike doing. In 2015, the

Upshot published this, which is a three-dimensional interactive representation of the yield curve, one of the most notoriously difficult ideas in economics. Now I'm not sure I still understand the yield curve, but I really appreciate their optimism about their audience, which I think is a real thing. I think data journalism is sneakily improving the level of data literacy in general. So I really think that's good. They are also complementing the ability of governments to generate data and share that data. Just as two small examples, the Washington Post now has a better database of gun deaths than

the FBI does. And the New York Times has a better data set of traffic safety than the Department of Transportation does. So these are good things. I think we're improving data literacy and the availability of data is continuing to grow and grow. Those are all good. Here's the thing I'm worried about, and this comes from the General Social Survey. If you are not familiar with it, it is a really great source of data. They have been doing a repeated cross section of adults in the US since the 1970s. They have a total now of more

than 70,000 respondents. And one of the questions that they have asked during every cycle since 1972 is this one. Taken altogether, how would you say things are these days? Would you say that you are very happy, pretty happy, or not too happy? So I grabbed that data, and I have plotted it over time. This is the fraction of people who said very happy. It was in the mid thirties when the survey started. It was declining slowly and then around 2010 started to decline a bit more quickly. Now if you look at that over time, it

does not look great, but it's a relatively slow decline. Let me show you now what that looks like by year of birth. This is the fraction of people who say that they are very happy grouped by what year they were born in. So for people born in the 1880s, because this data set goes back away, they were obviously they were very old when they were interviewed, but they were pretty happy. It was declining for a while and it has been declining very steeply now for people born in the 80s, 90s, and 2000s. This is not

just because we're interviewing when them when they are young. And to see that, let me show you this, and this takes some unpacking. This is now grouped by decade of birth. So each line represents one decade of birth from the 1890s up to the bottom right hand corner. Those are the people who are born in the 2000s and following them over time over the years of the survey. So there's a lot going on. Things kinda go up and down. For many groups, things have been declining recently, but the noteworthy thing in the lower right hand

corner there is that people born in the 90s and 2000s are more unhappy now than any previous generation at any age. So why? Why is that happening? It's complicated. It's always complicated. There are going to be many factors. I wanna talk about just one of them, which is seems likely that at least part of this is excessive consumption of relentlessly negative media. Now negativity bias is not new. There has always been negativity bias in the media and fundamentally in our heads. In our psychology, we have a built in negative bias, but the pattern of consumption

is a new thing. We have a new word for it, which is doomscrolling. This is spending too much time reading large quantities of media, especially negative media. So from a data science point of view, I think this is a data bias problem. It is a bias in our media diet, which suggests that data could be an antidote or at least a partial solution to the problem. Now the world champion of that idea that data can be an antidote to negativity was Hans Rosling. This is his famous video showing a bubble graph of world development in

the last hundred years as a function of income and life expectancy. Out of curiosity, how many people have seen this video? Okay. A lot. I thought so. If you have not, I have four minutes and forty seven seconds that you are really going to enjoy. If you haven't read, Hans Rosling's book, I recommend this very very strongly. This is the the antidote to a lot of incorrectly negative beliefs that many people have about the world. That's from 2018, so it's pretty current. But if you want the most up to date data, our world in data

is the current champion of this idea. They do research and data to make progress against the world's largest problems. And if you explore their site, you will see lots of graphs where good things go up into the right, like life expectancy, and bad things like poverty go down into the right. So check that out. I think you'll enjoy it. And what you'll find is that on long term trends, almost everything is getting better. Now people don't know this, and I wanna do an experiment. This is from Gapminder. This is the same group that Hans Rosling

started. They have a lot of tests that you can take to see if your perception of the world is accurate, and I'm gonna give you one of them. So how did the number of deaths per year from natural disasters change over the last hundred years? Take a second. More than doubled, remained about the same, decreased to less than half, more than doubled, raise your hand, stayed about the same, decreased to less than half. Okay. If you said that it decreased to less than half, you are correct. And here's the data from Our World in Data.

This is the raw number of deaths from natural disasters going back to nineteen hundred. And depending on when you start and how you compute the difference, it has declined by a factor of five or ten. In that time, at the same time, world population has gone up by about a factor of five. So as a rate, this has gone down substantially. K? This is good news. People do not know this good news. When people take these quizzes, they do less well than chance. Eighty four percent of people got this wrong. There were only three choices.

Thirty three percent should have got it right. Sixty six should have got it wrong. Eighty four is too many. So people don't know this. I gave one of these quizzes to my data science class. I had the students take the quiz and then write some reflections on it. This is the Google survey. I don't expect you to read all of this, but I wanna draw your attention to one word that appeared in almost every response. I was too pessimistic. And one person just wrote, I'm a pessimist. And I wanna tell you what I told them.

You are not a pessimist. You have been misled by a relentlessly negative media diet. On long term trends, almost everything is getting better. I have found that when I say that, people get angry. And my conclusion is we need to say three things at the same time. On long term trends, almost everything is getting better, and we still have serious problems to solve, and our history of solving problems suggests that we can solve these new ones too. And when I say that, yes, I'm including climate change. So where I think we are on climate change,

we have not responded as quickly as we should have. We're still not doing everything we should be doing, but many environmental trends are already going in the right direction, and there are paths between now and the end of the century toward a stable, healthy environment and a good quality of life for everybody on the planet. So we are not doomed. But a lot of people think we are. If you think we're doomed or someone in your life thinks we're doomed, please give them this book. I don't have a chance here to make a complete case

about what I think where we think where we are here. But, Hannah Ritchie, who is now coincidentally a researcher at Our World in Data, this book I think makes the case really well. Get a copy, read it out loud to somebody under thirty. So negativity bias, I think is a serious threat to our well-being because it undermines our ability to address the important problems that we need to address. So finally, here is my plea. Here is what I would like you all to do to help us get out of this trough of disillusionment and onto

the plateau of productivity. First, stop telling kids they'll die from climate change. Second, stop reading the news. This is the other book I wanna recommend. The title says it all, but if you have trouble mustering the strength to stop reading the news, this book might help. Last thing, use data to understand the world better so that we know how to make the world better. There's a lot of data out there, and I'll tell you a secret. A lot of the organizations that generate open datasets, and especially government agencies, have enough resources to make the dataset

and publish it, and often not a lot of resources to do much with it. So when you're the first person to go into one of these datasets and really look around and explore, you will inevitably find interesting things, like first babies are more likely to be late or things that are more important than that. So take advantage of the data that's out there. And this group in particular has all the tools and processes that you need to answer questions, resolve debates, and make better decisions. It's the tools of open science. And let me turn that

sideways so we can read it. Open data, open source software, open methodology, open peer review, open access, and especially from my point of view, open educational resources. So I wanna end on that because it's one of the things that I work on. I'll give you a few links and, credit for some of the resources that I've used. If you are interested in that first baby example, that is from Think Stats. And the third edition is what I'm working on right now. It's available free at that link. I'll also give you another chance to get these

slides so that you can get those links. The data there, as I said, is from the National Survey of Family Growth run by the CDC. The resampling example, if you're interested in that, that is from Elements of Data Science, also available free. And all of that data was from the BRFSS, also from the CDC. Finally, that happiness example is based on chapter ten of Probably Overthinking It, and that data is from the General Social Survey. Last thing, all of the notebooks are available. Those links will take you to the notebook running on Colab. So if

you wanna replicate anything I did or run your own experiments, you can do that. And as of Monday, I learned enough R to translate my Python examples into R. So if you wanna see the variability example in R, it's there. This is, I will admit, the first R program that I wrote, beyond hello world, and I am sharing it with all of you to do a code review. So I would like to hear what you think about my first attempt. So I feel like I've had my chance to talk. Here are five ways that you

can get in touch with me if you wanna talk back, but we also have chance to take some questions. And again, you can grab the slides using either that link or the QR code. So questions. Thank you. Thanks for a really inspiring talk, Alan. So again, if you've got questions you can ask them on the Sli.do, link from the app, and we've got a few coming in already. Okay. A lot coming in already. Okay. Two little we're we're kind of starting at the the beginning of the talk. So two questions, kind of related about about

about bootstrapping. First of all, like survey data. You know, it's really important to account for the sampling design and the weights. And what about the case where you're looking at smaller datasets with heavier tails? Like, where do you kinda draw the line between computational statistics and mathematical theory? Yes. All good questions. So the first one about the sampling design, I actually cut that from the slide, but it is in the notebook. You'll see the BRFSS uses stratified sampling, so I had to correct for that by using the sampling weights as part of the bootstrap process.

And that all I had to do was change the sample method to take an additional parameter, which is the sampling weights. So that's one of the nice things about bootstrap is, that kind of reweighting is super easy to do. With small samples, you do run into the Achilles heel of bootstrapping, which is data diversity. If you have a small sample and you draw new samples from it, the new samples will just all look the same, and you won't see enough diversity in the results. And in that case, one option is to switch to a parametric

bootstrap, where instead of just doing sampling with replacement, you take your data, you build a model of the population, but the model now is a parametric model. And so it has some smoothness to it. And now when you draw samples, those samples will be continuous and diverse in all the ways that that we want. So where do I draw the line? I really don't. I use computational stats for everything. One concept I learned about recently, and maybe you've heard of too, is that this idea of the Bitter Lesson. Have you heard this is an idea

from, and I'll probably paraphrase it incorrectly, but from computer science and kinda AI that, like, simple methods that use computation, like, end up winning regardless over time. And it felt to me like that's what you're really saying about the bootstrap. Like, it's a simple method. It requires computation, but over time, computation continues to get better and better, and we don't have to worry about kind of deriving all of these, like, special special cases with this. Yeah. Really interesting. Would you say the name again? Bitter Lesson. Ah, okay. Awesome. I'll send you the I'll send you

I can send you a link. Please. It's a cool idea. Okay. Next question. I think this question possibly comes from a pessimist. I'm gonna say that. I can hear you. How do you suggest we calibrate our world worldview when you account for Simpson's Paradox, where all of these large trends are positive, but if you break them down into smaller categories, they all go the opposite way? I think that's possible in theory, but I don't think I've seen examples of it. You know, one of my examples looking at poverty globally, the global trend was, you know,

declining poverty, and then when you divide it up by region, all of the regions are negative. There will always be exceptions. There will be short term reversals, and there will be specific places and times, so where things are not necessarily positive. But the long term trends are positive. I have not seen examples of Simpson's Paradox there. Thank you. What was the hardest part of writing your first R script? Oh, you know what the answer is gonna be. It was fixing my environment. I had an old broken installation of R, and it took me an hour

and a half to get it fixed. It's not Python. Most people don't have it. Oh, I hear you. Don't have environment. Same complaint about Python all the time. Sure. What do you think that statistics educators need to do to get on the boat, and what do data science educators need to do to make sure their students stay statistically responsible and and literate? Yeah. What do educators need to do to get on the boat? And sorry. Second part? So statistics educators, how do we get them moving towards more the computational side. Yeah. And I think while

still helping data science educators to kinda cultivate that skepticism that statistics is is so good at. Right. Well, so one piece of this, a lot of universities are creating data science programs, and they're going about it in a way that is how universities work, which is that they get a bunch of computer scientists and a bunch of statisticians and try to make them play together. And the assumption is that that is somehow going to be data science. And I think that doesn't work. I think data science is a different thing from both of those, and

starting with people who are professionally attached to those identities is probably not the right starting place. So I think if you're gonna do data science education, you need data scientists. Interesting. Thank you. K. This is this is an interesting one. What do you sacrifice in your life to be so prolific in your output? I was amazed by all the books you've done. Oh, thank you. I really I don't feel like it's a sacrifice at all. I really enjoy the work. I look forward to it. I make consistent progress, so I don't feel like I have

to work in, you know, sudden bursts of, you know, staying up all night. I don't do all nighters, that kind of thing. But I don't think I'm neglecting other parts of my life. Is your wife in the audience? My wife is in fact here, so we can ask. I will also say, like, as as someone who's written quite a few books too, I find that, for me personally, it's that writing regularly, just making that steady progress. Yeah. And that advice I've given to, like, everyone who wants to write a book. And so far, like, no

one has successfully followed that advice. It's a very, it's a very simple idea, but it's very, very hard, to do. Yeah. But I think at some point you just get, like, just becomes like your default. And you just, I personally feel when I'm not writing a book, like, something is missing from my advice. I don't know if you feel the same. I think for me, it's it's the most relaxing kind of work that I do. So if I'm stressed about something, you know, taking some time to focus and do some writing is a relief. Hey.

Humans and governments often only seem to act on big challenges like climate change when the challenge seems like a crisis. How do we inspire change without a sense of doom? Yeah. And I do think that's part of how we got to climate numerism, which is climate denialism was such a problem for so long that people felt like they had to turn up the heat metaphorically and, you know, insist more and more strongly, this is a crisis we need to react now, and it's we've almost gone too far. So, yeah, how do you get people engaged

without invoking that overshoot? That is a really hard question. I don't know. I think, like, to me, I think about the year 2000 kind of crisis that where everyone was like, oh my god. It's gonna be terrible, and then nothing happened. And so people are like, oh, it was overblown. But you can't tell whether or not it was overblown because people did all this work to make sure. Like, that it's just this problem that if you fix a problem before people experience it, they don't appreciate you as much. This is kind of like a this

is a problem in software development as well. Right? If you make, if you remove people from having to feel pain, they don't appreciate you. You have to, like, make them feel pain and then take it away, and then they appreciate you. This is not a technique I use. I'm just saying. That's that's a a good analogy. But it also reminds me, we do have a long history of actually solving problems. And so, you know, a recent relatively current one is the hole in the ozone layer. In the 1980s, that was the big crisis. And when

was the last time you even heard anybody say ozone layer? And the reason is that we have largely solved that problem. We banned most of the chemicals that were causing the damage. The ozone hole has been closing for decades and will probably completely close by the end of the century. So we actually got together. I think it's the Montreal Protocol, I think, was the treaty, international treaty. I think a hundred and ninety countries signed on to it and got it done. Yeah. Again, it's very it's so easy to miss, like, the pain that's gone away.

Yeah. How do you reconcile a desire to be an informed and engaged citizen with advice that says to read less news, to be happy? So I've started doing this within the last year. I've consumed almost no news media. I do read The Economist, so that's an exception. And what I found is that if something is important enough that I really need to know it, I will hear about it. Yeah. Yes. Here's a good one. Should we distinguish data science from science in general? Is there a meaningful distinction? Yeah. So, you know, is data science just

doing science? Because when you're doing science, you're almost always working with data, and you are using that data to answer questions, resolve debates, and make better decisions. So, yeah, I think there's an argument there. That last part, though, might be a difference, which is that science is usually about creating knowledge and not necessarily designing things or making decisions. So maybe what that means is that data science is actually broader than science because it also includes those elements. Yep. Yep. Makes sense. How do you reconcile optimism against an overwhelming amount of climate scientists that feel we're

approaching an unreconcilable ecological disaster and whose historical recommendations have a bias towards being overly optimistic already. Right. So, yes, the one way that I could be wrong, if we look at long term trends and we say, hey. Things are going well. If we keep doing what we're doing, because, you know, those long term trends didn't happen automatically. We did things that made them happen. So if we keep doing what we're doing and keep solving problems as we confront them, we should continue to make progress. The exception to that would be something like a tipping point

where we really do cross a line that we can't go back over. And there are some scenarios in in climate predictions where that is a possibility. My understanding of current climate science is that we don't have strong evidence that any of those are imminent or high probability events. So I am still going on the assumption that if we keep doing what we're doing, things will continue to get better, but something like that could make me wrong. So you talked about, you provided us a hypothesis that doomscrolling is the leading cause of lower reported happiness in

younger people. Do you have any data to support that hypothesis? So, yes. I claimed that, consumption of negative media is one of the causes of unhappiness. I have read articles to support that. There are, books like Anxious Generation and a few others. I'll get you some more references. So I haven't worked with that data myself, but I have read enough books to make me think that it is a plausible hypothesis as at least part of the problem. Not trying to assert that I know for sure that it is all of the problem, not that. This

next question is kinda related, but, like, how do you if you're gonna ignore the news, how do you avoid getting stuck in a trough of toxic positivity where you're just gonna believe everything is fine? Is there a trough of toxic positivity? I think the the real qualifier is just the assumption that just because things are getting better, that doesn't mean it will continue to happen automatically. We still need to do the things that make that happen. What we're depending on is the ongoing effort of everybody working on these problems and the innovation that will create

the new things that will continue to make progress. So I think that would be not the unwarranted optimism that just assumes that everything magically is gonna get better. So not that. And I think, you know, there are also other ways of learning about the world and following the news too closely. You can read, you can read books. You can read articles. Like, the news cycle is very rapid and designed to keep you addicted. Yeah. And Yeah. Yeah. There are ways to kind of try and escape that. Yeah. Longer term media. And, you know, as as

a book author, I recommend that you read books. Yeah. How do you account for the shifting baseline syndrome regarding the definition of quality of life? I think this is bit of related to the, like, hedonic treadmill as well. Like, also, as as things get better over time, you expect them to be better. And then when they get worse, even if they were better than they were before, it still feels bad. Yes. Well, so the shifting definition of poverty is a good one, and it's, in some sense, a sign of progress. You know, at some point

in time, you pick a threshold, and then you target that. And if things go well, eventually that target becomes obsolete and you aim for a higher threshold. So that's good. That's a sign of progress. It just means you have to be consistent in your definitions when you're doing the data analysis. But there was a second part there that was a little more challenging. Remind me. I think that question got wiped off already. So I've already forgotten that too. Yep. And that is why you should not ask two part questions. Yeah. What about the the negativity

bias? Is it truly the do you like do you think it's the fault of the media or the reader's behavior is interacting with the negative news more? It's definitely both. So negativity bias is one of our cognitive errors. It seems to be just built into human beings. And in some sense, it makes sense that we should focus a little bit more on things that are broken and need to be fixed. So it's potentially adaptive in that sense. But now if you put that brain into an environment where it gets that constant feedback, I think that's

where the hazard comes from. Because the, you know, the producers of news have figured out that people will pay more attention to negative news than positive news, so that's what they feed us. And I think that's the feedback cycle that that is the source of the problem. Well, and I think I mean, I will say that I think, like, you know, data science is partly to blame there because now news organizations are very carefully tracking, like, what are people reading and responding to Yeah. And they wanna serve their customers. They wanna, you know, they wanna

make money. They wanna give the people what they want. And that can easily lend itself to this vicious cycle where they're like, well, what are what are people reading the most? And that's, you know, if you're not careful, that's like listicles and, other things that, like, yeah, the secrets that so and so doesn't want you to know. Right? Like, there is this balance between, like, how do you survive as an organization in the short term? Like, how do you make money while still achieving your mission in the long term? And, like, not being too distracted

by, like, what the data is telling you. I think there's a sort of a really interesting tension. Yeah. There that you wanna, like, pay attention to the data and listen to the data, but you can't just follow it. You can't just A/B test everything and then do whatever the maximizing your revenue or whatever you wanna maximize it. Yeah. Yeah. It's a challenge. How effective is data in combating negativity bias in a media environment where people can just curate their feeds to tell them what they wanna hear? Do you find people receptive to this data that

goes against everything social media is selling them? No. Yeah. No. I mentioned this very often when I say, you know, on long term trends, almost everything is getting better. Very often, people really do react angrily, and part of it is, you know, cognitive dissonance, which is, you know, they've heard the opposite over and over, and you can't nobody changes their minds instantly because I tell them something. And it can be mistaken for unwarranted optimism. It sounds like I'm saying, oh, everything's fine. Don't worry. We don't have to do anything. So that's what I said in

the talk. I find myself I need to say three things at the same time. Yeah. And one of them is things are getting better, but also things are still very bad. But also we can continue to make progress on those very bad things. Which is a three part statement. And as we've already shown that, probably people are forgetting the second two parts almost immediately. Exactly. And yeah. It's hard to get all of the the pieces into your head at the same time to understand these more complex side. Listen. How do you respond to arguments that,

on average, the world is improving, but at the same time, our existential risk is increasing because the tails of the distribution are increasing? I think that's actually two things. The existential risk and the tails of the distribution. I think the second part is not true. The tails are generally going in good directions. There are examples of greater inequality, but those are more often coming by the top moving very far to the right. Usually the bottom is also improving but more slowly. So I don't think that there are other than, as I said, you know, short-term

localized places where things get worse. By and large, on large scales, we are all making progress. And especially if we look at things like extreme poverty and meeting basic human needs, we are doing that much better now than at any previous time in the history of humanity. Yeah. I think you are one hundred percent correct that we under celebrate the advances we have made, but it's a fallacy that just because we have solved problems in the past that we will in the future. And I think you are gaslighting younger generations who face real challenges other

generations have not had to face. Okay. So it's a fallacy in the logical sense, which is just because something happened in the past, I can't prove that it will happen in the future. That's the problem of induction. I can't solve the problem of induction, but I can reason from evidence. And I can say probabilistically that processes that have been successful in the past are likely, not provable, you know, not provably, but likely to continue to be successful. And if we look at the last hundred years and look at what has worked, we should do more

of that. And so, no, it's not a mathematical proof. It is an assertion from evidence. Kind of a follow-up to that. How much do you think is shaped by your outlook as a white male and the privilege that we get from that? Yeah. My life is going pretty well. So and that is, I think, one of the reasons that people get angry when I say this, because I'm in a very privileged position. And, you know, that's true as a white male. That is true as an American. I am educated and affluent. I have every advantage

in the world. So I do understand why it could be off putting for me to stand up here and say how wonderful everything is. But the reason I go to data is to show that it's really not just for me. It is long term trends. It is almost every it is every region of the world. It is almost every country in the world. It is getting better for almost everybody almost all the time. You know, not just me. But, yeah. I think well, many of these questions kind of come down to this, like how much

do you believe, like how much do you trust what you believe and go and find data to support that? Mhmm. And how much do you trust the data and then update your beliefs in the face of that. So, how do you think about, how do you update your, updating your beliefs, changing your beliefs is hard. Mhmm. How do you, how do you do it? What's some belief that data has changed your mind on? I think, you know, fundamentally, I am curious. When I get a new dataset, that is a good day for me. All of

these surveys that I've been following, I've been following them for years. And when they release new data, that's a week of my life that I'm gonna spend exploring that data. And I'm gonna be surprised a lot, and I enjoy that more when I'm surprised than when I'm not. It's I think, yeah, there's always confirmation bias, but, you know, I don't feel good when I see data that just tells me what I already knew. I feel excited when I see something new that surprised me, and that's the first thing that I wanna put in my blog

and share with everybody. So I do think fundamental scientific curiosity and wanting to know and wanting to know the truth is at least a partial antidote to confirmation bias. So we have, like, a minute left, and I'm gonna go on the record and say I don't wanna end the last question on a bummer. Mhmm. So I am gonna skip a few of these questions. Oh, no. Sorry. Sorry, people. I want, like, I wanna I wanna end this on a, yeah. Let let me ask people. So, I mean, you all just asked me to reflect on

whether my outlook reflects where I'm coming from. So for people, if what I said is off putting and you're sending these questions that are and I appreciate that. And come talk to me more. Let's talk about this. But also, I ask yourself the same question, and I'm genuinely curious to know why, why do you feel that way? And maybe we can talk about that. I don't know. Do we should we do a show of hands of, like, who is feeling, like, genuinely, like, generally optimistic about the future versus pessimistic about the future? Or is that

I don't know. Is that like true? I don't want to put people on the spot. High risk. We can have everyone close their eyes so they can't see the results. You're done. This is really high, high quality data collection. Yeah. Okay. I'm I'm still gonna find you, like, one okay. Here's a gimme. Let's finish on this. What advice do you have for new data scientists? Well, learn Python. No. Now I do wanna go back to the the the list that I had there of open science resources. That is your toolkit for doing data science, and

it starts with open data and, and open software, and that certainly includes both R and Python. Right. Thanks so much, Alan. Thank you very much. Thank you.