Data Detective: 10 easy rules to make sense of statistics - Book Summary and Review
The introduction presents many cases in which statistics have been used to tell a falsehood or otherwise obscure the truth, author Tim Harford presents a beautiful counterpoint:
Yes, it’s easy to lie with statistics, but it’s impossible to tell the truth without it.
As the saying goes, we don’t want to throw the baby out with the bathwater. Just because our data is erroneous, incomplete, or otherwise flawed doesn’t mean it’s useless. It just means that we need better data, or otherwise better tools to gather that data.
Chapter 1 deals with motivated reasoning, and how easy it is for us to throw out any statistics that conflicts with our feelings or worldview. We are all susceptible to this; just because we are educated doesn’t mean that we are immune. Harford goes into great detail about many successful people who fell into this trap, particularly a well-known art appraiser who refused to believe a particular painting was a forgery despite the plethora of evidence.
Chapter 2 explains the difference between a birdseye view of the world versus the worms eye view. In my opinion, a better analogy is the Google maps view versus the pedestrian view. Statistics is the Google Maps view; it gives you a nice overview of the landscape, and allows you to summarize large quantities of information at a single glance. However, the pedestrian view is required as well. Anybody who is familiar with a given neighbourhood knows that Google maps does not always give us the best route.
In this analogy, the pedestrian view are stories; we can use numerical and objective data to understand the landscape, but the stories and more detailed bits of evidence and complements the data so as to give the proper context and meaning.
Chapter 3 deals with categorization and definitions of data. Harford uses an interesting anecdote which summarizes the point: for the last several years, researchers noticed that there is a chronically lower infant mortality rate in larger cities when compared to the more rural neighborhoods. At first glance, this might be due to better technology, or better doctors, but in reality this is simply a matter of definitions. People in the urban hospitals were more likely to categorize a child at 22 weeks as a miscarriage, whereas the rural doctors categorized the same child as an infant.
The lesson is the following: before we even go into rigourous numerical analysis and start solutioning, it’s important to understand what exactly we have measured, and how the data was captured.
Chapter 4 builds off of chapter 3, further delving into the importance of context when analyzing data. For example, if I told you that they have been over 3000 heart attacks in Toronto, would you know how to make sense of that? Not really. First, we would need a reference. Is this an annual figure? Weekly? Moreover, is 3000 a lot or a little? How does that compare to New York? Singapore?
Again, this might seem obvious when stated like this, but it’s very easy to lose sight of this fact. All data requires context, whether it is timescale or some other reference range.
Chapter 5 deals with survivorship bias, and how the data we acquire actually might not be an actual representation of the broader landscape. For example, most social media algorithms are designed not to present an objective reality, but those things you are most likely to click on, no matter how skewed or provocative they may seem. Even books are susceptible to this; when was the last time we actually read a book that was not on the New York Times best-selling list? The reality is, our world is highly filtered, and we only get to see a small sliver and any given moment.
Another part of this chapter deals with a favourite topic of mine: the replication crisis in the social sciences. This crisis is a manifestation of survivorship bias, taking place in the academic world. There are a lot of provocative studies that got published over the last few decades which have objectively incredible results (and when I say incredible, I mean literally not credible).
The reason they got published is because, through sheer random chance, they had an outlandish result. For example, if I flip a coin enough times, it’ll eventually be the case that I get heads 20 times in a row. The replication crisis represent all of those studies which are the equivalent of getting heads 20 times in a row.
Chapter 6 further elaborates on sampling biases. For example, in many of the medical field, a lot of studies do not actually include women, considering men are more biologically simple (they do not have fluctuating hormonal patterns, ovulatory cycles, etc.). As a result this gives us a flawed understanding of certain drugs/procedures.
Conversely, I believe the reverse effect is true in fields relating to psychology; the majority of study participants are women, and usually in their 20s. As a result we get an understanding of the human mind primarily based on a small subset of our broader population, and come to erroneous conclusions about the universal nature of thoughts/mindsets.
Chapter 7 is a further discussion on algorithms, and how they increasingly shape our world. The big take away is the following: we cannot treat the algorithm like a black box, as if it is some mysterious entity out of our control. We need to systematically understand its various pieces, and regulate it where required.
Chapter 8 deals with the intentional use of bad statistics, employed either for personal gain or by governments to control their populations. Harford has a number of examples of corrupt governments effectively forging numbers in order to paint whatever narrative they desire. Of course, the problem with fake statistics is that eventually they become so detached from reality that nobody trusts the governments that publish them.
Chapter 9 is effectively the summarization of the subreddit r/dataisugly. Data visualizations can be powerful, and can be quite effective in summarizing large bits of information in a simple and clean output. But it’s important that we don’t treat it like a black box. The visualizations can also deceive us, or manipulate our emotions if they are based on faulty underlying sources.
Chapter 10 has a rather interesting story of two men: Irving Fisher versus John Maynard Keynes. The latter had a rather interesting life; ostensibly, during World War One he had initiated an art heist in war torn Paris.
Ultimately Harford illustrates the lives of these two men for the following reason: while they were both brilliant, Fisher lost all of his fortunes in an erroneous attempt to predict the future, while Keynes had a much better go of things. They both made bets on the stock market, and when 1929 happened, they had much different outcomes.
In order to predict the future, we need to find some base rate to orient us in our initial prediction, and then make adjustments based on additional information. Most importantly, we cannot get too attached to our previous conceptions of the world. Fisher got too attached; Keynes did not — as a result they had wildly different outcomes in their lives.
Chapter 11 presents the final rule, the golden rule: be curious. Instead of viewing statistics as a place of dry facts and boring details, we need to be genuinely intrigued by all the surprising information around us. As a personal anecdote, I find that Christopher Nolan has always stimulated my love for science more than any science teacher. And I think this is the same thing with statistics; we need to make the truth fun. To be wrong about the world should fill us with joy, because it means we have so much to learn.
Overall, good book.