Back when we were reading Karl Popper’s Logic of Scientific Discovery and Thomas Kuhn’s Structure of Scientific Revolutions, who would’ve thought that we’d be living through a scientific revolution ourselves?
Scientific revolutions occur on all scales, but here let’s talk about some of the biggies:
1850-1950: Darwinian revolution in biology, changed how we think about human life and its place in the world.
1890-1930: Relativity and quantum revolutions in physics, changed how we think about the universe.
2000-2020: Replication revolution in experimental science, changed our understanding of how we learn about the world.
When it comes to technical difficulty and sheer importance of the scientific theories being disputed, this recent replication revolution is by far more trivial than the earlier revolutions in biology and physics. Still, within its narrow parameters, a revolution it is. And, to the extent that the replication revolution affects research in biology, medicine, and nutrition, its real-world implications do go a bit beyond the worlds of science and the news media. The replication revolution has also helped us understand statistics better, and so I think it potentially can have large indirect effects, not just about ESP, beauty and sex ratio, etc., but for all sorts of problems in science and engineering where statistical data collection and analysis are being used, from polling to genomics to risk analysis to education policy.
Revolutions can be wonderful and they can be necessary—just you try to build a transistor using 1880s-style physics, or to make progress in agriculture using the theories of Trofim Lysenko—but the memory of triumphant past revolutions can perhaps create problems in current research. Everybody wants to make a discovery, everybody wants to be a hero. The undeniable revolutionary successes of evolutionary biology have led to a series of hopeless attempted revolutions of the beauty-and-sex-ratio variety.
The problem is what Richard Feynman called cargo-cult science: researchers try to create new discoveries following the template of successes of the past, without recognizing the key roles of strong theory and careful measurement.
We shouldn’t take Kuhn’s writings as gospel, but one thing he wrote about that made sense to me is the idea of a paradigm or way of thinking.
Here I want to talk about something related, which are the storylines or narratives that run in parallel with the practice of science. There stories are told by journalists, or by scientists themselves; they appear in newspapers and movies and textbooks, and I think it is from these stories that many of our expectations arise about what science is supposed to be.
In this discussion, I’ll set aside, then, stories of science as savior, ensuring clean baths and healthy food for all; or science as Frankenstein, creating atomic bombs, deadly plagues, etc.; or other stories in between. Instead, I’ll focus on the process of science and not its effects on the larger world.
What, then, are the stories of the scientific process?
Narrative #1: Scientist as hero, discovering secrets of nature. The hero might be acting alone, or with a sidekick, or as part of a Mission-Impossible-style team; in any case, it’s all about the discovery. This was the narrative of Freakonomics, it’s the narrative of countless Gladwell articles, and it’s the narrative we were trying to promote in Red State Blue State. The goal of making discoveries is one of the big motivations of doing science in the first place, and the reporting of discovery is a big part of science writing.
But then some scientists push it too far. It’s no surprise that, if scientists are given nearly uniformly positive media coverage, that they will start making bigger and bigger claims. It’s gonna keep happening until something stops it. There have been the occasional high-profile cases of scientific fraud, and these can shake public trust in science, but, paradoxically, examples of fraud can give “normal scientists” (to use the Kuhnian term) a false sense of security: Sure, Diederik Stapel was disgraced, but he faked his data. As long as you don’t fake my data (or if you’re not in the room where it happens). And I don’t think many scientists are actively faking it.
And then the revolution, which comes in three steps:
-
Failed replications. Researchers who are trying to replicate respected studies—sometimes even trying to replicate their own work—are stunned to find null results.
-
Questionable research practices. Once a finding comes into question, either from a failed replication or on theoretical grounds that the claimed effect seems implausible, you can go back to the original published paper, and often then a lot of problems appear in the measurement, data processing, and data analysis. These problems, if found, were always there, but the original authors and reviewers just didn’t think to look, or didn’t notice the problems because they didn’t know what to look for.
-
Theoretical and statistical analysis. Some unreplicated studies were interesting ideas that happened not to work out. For example, could intervention X really have had large and consistent effects on outcome Y? Maybe so. Before actually gathering the data, who knows? Hence it would be worth studying. Other times, an idea never really had a chance: it’s the kangaroo problem, where the measurements were too noisy to possibly detect the effect being studied. In that beauty-and-sex-ratio study, for example, we calculate that the sample size was about 1/100 what would be needed to detect anything. This sort of design analysis is mathematically subtle—considering the distribution of the possible results of an experiment is tougher than simply analyzing a dataset once.
Points 1, 2, and 3 reinforce each other. A failed replication is not always so convincing on its own—after all, in the human science, no replication is exact, and the question always arises: What about that original, successful study. Once we know about questionable research practices, we can understand how those original researchers could’ve reported a string of statistically significant p-values, even from chance alone. And then the theoretical analysis can give us a clue of what might be learned from future studies. Conversely, even if you have a theoretical analysis that a study is hopeless, along with clear evidence of forking paths and even more serious data problems, it’s still valuable to see the results of an external replication.
And that leads us to . . .
Narrative #2: Science is broken. The story here is that scientists are incentivized to publish, indeed to pile up publications in prestige journals, which in turn are incentivized to get citations and media exposure. Put this together and you get a steady flow of hype, with little motivation to do the careful work of serious research. This narrative is supported by high-profile cases of scientific fraud, but what really made it take off was the realization that top scientific journals were regularly publishing papers that did not replicate, and in many cases these papers had made claims that were pretty ridiculous—not necessarily a priori false, and big news if they were true, but silly on their face, and even harder to take seriously after the failed replications and revelations of questionable research practices.
The “science is broken” story has often been framed as scientists being unethical, but this can be misleading, and I’ve worked hard to separate the issue of poor scientific practice from ethical violations. A study could be dead on arrival, but if the researcher in question doesn’t understand the statistics, then I wouldn’t call the behavior unethical. One reason I prefer the term “forking paths” to “p-hacking” is that, to my ear, “hacking” implies intentionality.
At some point, ethical questions do arrive, not so much with the original study as with later efforts to dodge criticism. At some point, ignorance is no excuse. But statistics is hard, and I think we should be able to severely criticize a study without that implying a criticism of the ethics of its authors.
Unfortunately, not everyone takes criticism well, and this has led some of the old guard to argue . . .
Narrative #3: Science is just fine. Hence we get claims such as “The replication rate in psychology is quite high—indeed, it is statistically indistinguishable from 100%” and “Psychology is not in crisis, contrary to popular rumor. . . . All this is normal science . . . National panels will convene and caution scientists, reviewers, and editors to uphold standards. Graduate training will improve, and researchers will remember their training and learn new standards.”
But this story didn’t fly. There were just too many examples of low-quality work getting the royal treatment from the scientific establishment. The sense that something was rotten had spread beyond academia into the culture at large. Even John Oliver got in a few licks.
Hence the attempt to promote . . .
Narrative #4: Attacking the revolutionaries. This tactic is not new—a few years ago, a Harvard psychology professor made some noise attacking the “replication police” as “shameless little bullies” and “second stringers”, and a Princeton psychology professor wrote about “methodological terrorism”—but from my perspective it ramped up more recently when a leading psychologist lied about me in print, and then when various quotes from this blog were taken out of context to misleadingly imply that critics of unreplicated work in the psychology literature were “rife with vitriol . . . vicious . . . attacks . . . threatening.”
I don’t see Narrative 4 having much success. After all, the controversial scientific claims still aren’t replicating, and more and more people—scientists, journalists, and even (I hope) policymakers—are starting to realize that “p less than 0.05” ain’t all that. You can shoot the messengers all you like; the message still isn’t going anywhere. Also, from a sociology of science perspective, shooting the messenger also misses the point: I’m pretty sure that even if Paul Meehl, Deborah Mayo, John Ioannidis, Andrew Gelman, Uri Simonsohn, Anna Dreber, and various other well-known skeptics had never been born, that a crisis would still have arisen regarding unreplicated and unreplicable research results that had been published and publicized in prestigious venues. I’d like to believe that our work, and that of others, has helped us better understand the replication crisis, and can help lead us out of it, but the revolution would be just as serious without us. Calling us “terrorists” isn’t helping any.
OK, so where do these four narratives stand now?
Narrative #1: Scientist as hero. This one’s still going strong. Malcolm Gladwell, Freakonomics, that Telsa guy who’s building a rocket to Mars—they’re all going strong. And, don’t get me wrong, I like the scientist-as-hero story. I’m no hero, but I do consider myself a seeker after truth, and I don’t think it’s all hype to say so. Just consider some analogies: Your typical firefighter is no hero but is still an everyday lifesaver. Your typical social worker is no hero but is still helping people improve their lives. Your typical farmer is no hero but is still helping to feed the world. Etc. I’m all for a positive take on science, and on scientists. And, for that matter, Gladwell and the Freakonomics team have done lots of things that I like.
Narrative #2: Science is broken. This one’s not going anywhere either. Recently we’ve had that Pizzagate professor from Cornell in the news, and he’s got so many papers full of errors that the drip-drip-drip on his work could go on forever. Meanwhile, some of the rhetoric has improved but the incentives for scientists and scholarly journals haven’t changed much, so we can expect a steady stream of weak, mockable papers in top journals, enough to continue feeding the junk-science storyline.
As long as there is science, there will be bad science. The problem, at least until recently, is that some of the bad science was getting a lot of promotion from respected scientific societies and from respected news outlets. The success of Narrative 2 may be changing that, which in turn will, I hope, lead to a decline in Narrative 2 itself. To put it in more specific terms, when a paper on “the Bible Code” appears in Statistical Science, an official journal of the Institute of Mathematical Statistics, then, yes, science—or, at least, one small corner of it—is broken. If such papers only appear in junk journals and don’t get serious media coverage, then that’s another story. After all, we wouldn’t say that science is broken just cos astrology exists.
Narratives #3 and 4: Science is just fine, and Attacking the revolutionaries. As noted above, I don’t see narrative 3 holding up. As various areas of science right themselves, they’ll be seen as fine, but I don’t think the earlier excesses will be forgotten. That’s part of the nature of a scientific revolution, that it’s not seen as a mere continuation of what came before. I’m guessing that scientists in the future will look in wonderment, imagining how it is that researchers could ever have thought that it made sense to treat science as a production line in which important discoveries were made by pulling statistically significant p-values out of the froth of noise.
As for the success of Narrative 4, who knows? The purveyors of Narrative 4 may well succeed in their short-term goal of portraying particular scientific disagreements in personal terms, but I can’t see this effort having the effect of restoring confidence in unreplicated experimental claims, or restoring the deference that used to be given to papers published in prestigious journals. To put it another way, consider that one of the slogans of the defenders of the status quo is “Call off the revolutionaries.” In the United States, being a rebel or a revolutionary it’s typically considered to be a good thing. If you’re calling the other side “revolutionaries,” you’ve probably already lost.
An alternative history
It all could’ve gone differently. Just as we can imagine alternative streams of history where the South did not fire on Fort Sumter, or where the British decided to let go of India in 1900, we can imagine a world in which the replication revolution in science was no revolution at all, but just a gradual reform: a world in which the beauty-and-sex ratio researcher, after being informed of his statistical errors in drawing conclusions from what were essentially random numbers, had stepped back and recognized that this particular line of research was a dead end, that he had been, in essence, trying to send himself into orbit using a firecracker obtained from the local Wal-Mart; a world in which the ovulation-and-clothing researchers, after recognizing that their data were so noisy that their results could not be believed and after recognizing they had so many forking paths that their p-values were meaningless, decided to revamp their research program, improve the quality of their measurements, and move to within-person comparisons; a world in which the celebrated primatologist, after hearing from his research associates that his data codings were questionable, had openly shared his videotapes and fully collaborated with his students and postdocs to consider more general theories of animal behavior; a world in which the ESP researcher, after seeing others point out that forking paths made his p-values uninterpretable and after seeing yet others fail to replicate his study, had recognized that his research had reached a dead end—no shame in that, we all reach dead ends, and the very best researchers can sometimes spend decades on a dead end; it happens; for that matter, what if Andrew Wiles had never reached the end of his particular tunnel and Fermat’s last theorem had remained standing, would we then said that Wiles had wasted his career? No, far from it; there’s honor in pursuing a research path to its end—; a world in which the now-notorious business school professor who studied eating behavior had admitted from the very beginning—at least six years ago now it was that he first heard from outsiders about the crippling problems with his published empirical work—that he had no control over the data reported in his papers, and had stopped trying to maintain that all his claims were valid, instead worked with colleagues to design careful experiments with clean data pipelines and transparent analyses; a world in which that controversial environmental economist had taken the criticism of his work to heart, instead of staying in debate mode had started over, instead of continuing to exercise his talent of getting problematic papers published in good journals had decided to spend a couple years disentangling these climate-and-economics models he’d been treating as data points and instead really working out their implications; a world in which the dozens of researchers who had prominent replication failures or serious flaws in their published work had followed those leaders mentioned above and had used this adversity as an opportunity for reflection and improvement, as an aside thanking their replicators and critics for going to the trouble of taking their work seriously enough to find its problems; in which thousands of researchers whose research hadn’t been checked by others had gone to check their own work, not wanting to publish claims that would not replicate. In this alternative world, there’s no replication crisis at all, just a gradual reassessment of past work, leading gently into a new paradigm of careful measurement and within-person comparison.
Why the revolution?
We tend to think of revolutions as inevitable. The old regime won’t budge, the newcomers want to install a new system, hence a revolution. Or, in scientific terms, we assume there’s no way to resolve an old and a new paradigm.
In the case of the replication crisis, the old paradigm is to gather any old data, find statistical significance in a series of experiments, and then publish and publicize the results. The experiments are important, the conclusions are important, but the actual gathering of data is pretty arbitrary. In the new paradigm, the connection of measurement to theory is much more important. On the other hand, the new paradigm is not entirely new, if we consider fields such as psychometrics.
As remarked above, I don’t think the revolution had to happen; I feel that we could’ve gone from point A to point B in a more peaceful way.
So, why the revolution? Why not just incremental corrections and adjustments? Research team X does a study that gets some attention, others follow up and apparently confirm in. But then, a few years later, team Y comes along with an attempted replication that fails. Later unsuccessful replications follow, along with retrospective close readings of the original papers that reveal forking paths and open-ended theories. So far, no problem. This is just “normal science,” right?
So here’s my guess as to what happened. The reform became a revolution as a result of the actions of the reactionaries.
Part of the difficulty was technical: statistics is hard, and when the first ideas of reform came out, it was easy for researchers to naively think that statistical significance trumped all objections. After all, if you write a paper with 9 different experiments, and each has a statistically significant p-value, then the probability of all that success, if really there were no effect, is (1/20)^9. That’s a tiny number which at first glance would seem impervious to technicalities of multiple comparisons. Actually, though, no: forking paths multiply as fast as p-values. But it took years of exposure to the ideas of Ed Vul, Hal Pashler, Greg Francis, Uri Simosohn, and others to get this point across.
Another difficulty is attachment to particular scientific theories or hypotheses. One way I’ve been trying to help with this one is to separate the scientific models from the statistical models. Sometimes you gotta get quantitative. For example, centuries of analysis of sex ratios tell us that variations in the percentage of girl births are small. So theories along these lines will have to predict small effects. This doesn’t make the theories wrong, it just implies that we can’t discover them from a small survey, and it should also open us up to the possibility of correlations that are positive for some populations in some settings, and negative in others. Similarly in FMRI studies, or social pscyhology, or whatever: The theories can have validity even if they can’t be tested in sloppy experiments. This could be taken as a negative message—some studies are just dead on arrival—but it can also be taken positively: just cos a particular experiment or set of experiments are too noisy to be useful, it doesn’t mean your theory is wrong.
To stand by bad research just because you love your scientific theory: that’s a mistake. Almost always, the bad research is noisy, inconclusive research: careful reanalysis or failed replication does not prove the theory wrong, it just demonstrates that the original experiment did not prove the theory to be correct. So if you love your theory (for reasons other than its apparent success in a noisy experiment), then fine, go for it. Use the tools of science to study it for real.
The final reason for the revolution is cost: the cost of giving up the old approach to science. That’s something that was puzzling me for awhile.
My thinking went like this:– For the old guard, sure, it’s awkward for them to write off some of the work they’ve been doing for decades—but they still have their jobs and their general reputations as leaders in their fields. As noted above, there’s no embarrassment in pursuing a research dead end in good faith; it happens.– For younger researchers, yes, it hurts to give up successes that are already in the bank, as it were, but they have future careers to consider, and so why not just take the hit, accept the sunk cost, and move on.
But then I realized that it’s not just a sunk cost; it’s also future costs. Think of it this way: If you’re a successful scientific researcher, you have a kind of formula or recipe, your own personal path to success. The path differs from scientist to scientist, but if you’re in academia, it involves publishing, ideally in top journals. In fields such as experimental biology and psychology, it typically involves designing and conducting experiments, obtaining statistically significant results, and tying them to theory. If you take this pathway away from a group of researchers—for example, by telling them that the studies that they’ve been doing, and that they’re experts in, are too noisy to be useful—then you’re not just wiping out their (reputational) savings, you’re also removing their path to future earnings. You’re not just taking away their golden eggs, you’re reposessing the goose they were counting on to lay more of them.
It’s still a bad idea for researchers to dodge criticism and to attack the critics who are trying so hard to help. But on some level, I understand it, given the cost both in having to write off past work and in losing the easy path to continuing future success.
Just remember that, for each of these people, there may well be three other young researchers who were doing careful, serious work but then didn’t get picked for a plum job or promotion because it was too hard to compete with other candidates who did sloppy but flashy work that got published in top journals. It goes both ways.
Summary (for now)
We are in the middle of a scientific revolution involving statistics and replication in many areas of science, moving from an old paradigm in which important disoveries are a regular, expected product of statistially-significant p-values obtained from routine data collection and analysis, to a new paradigm of . . . weeelll, I’m not quite sure what the new paradigm is. I have some ideas related to quality control, and when it comes to the specifics of design, data collection, and analysis, I recommend careful measurement, within-person comparisons, and multilevel models. Compared to ten years ago, we have a much better sense of what can go wrong in a study, and a lot of good ideas of how to do better. What we’re still struggling with is the big picture, when we move away from the paradigm of routine discovery to a more continuous sense of scientific progress.