People pointed me to various recent news articles on the retirement from the Cornell University business school of eating-behavior researcher and retraction king Brian Wansink.
I particularly liked this article by David Randall—not because he quoted me, but because he crisply laid out the key issues:
The irreproducibility crisis cost Brian Wansink his job. Over a 25-year career, Mr. Wansink developed an international reputation as an expert on eating behavior. He was the main popularizer of the notion that large portions lead inevitably to overeating. But Mr. Wansink resigned last week . . . after an investigative faculty committee found he had committed a litany of academic breaches: “misreporting of research data, problematic statistical techniques, failure to properly document and preserve research results” and more. . . . Mr. Wansink’s fall from grace began with a 2016 blog post . . . [which] prompted a small group of skeptics to take a hard look at Mr. Wansink’s past scholarship. Their analysis, published in January 2017, turned up an astonishing variety and quantity of errors in his statistical procedures and data. . . . A generation of Mr. Wansink’s journal editors and fellow scientists failed to notice anything wrong with his research—a powerful indictment of the current system of academic peer review, in which only subject-matter experts are invited to comment on a paper before publication. . . . P-hacking, cherry-picking data and other arbitrary techniques have sadly become standard practices for scientists seeking publishable results. Many scientists do these things inadvertently [emphasis added], not realizing that the way they work is likely to lead to irreplicable results. Let something good come from Mr. Wansink’s downfall.
But some other reports missed the point, in a way that I’ve discussed before: they’re focusing on “p-hacking” and bad behavior rather than the larger problem of researchers expecting routine discovery.
Consider, for example, this news article by Brett Dahlberg:
The fall of a prominent food and marketing researcher may be a cautionary tale for scientists who are tempted to manipulate data and chase headlines.
I mean, sure, manipulating data is bad. But (a) there’s nothing wrong with chasing headlines, if you think you have an important message to share, and (b) I think the big problem is not overt “manipulation” but, rather, researchers fooling themselves.
Indeed, I fear that a focus on misdeeds will have two negative consequences: first, when people point out flawed statistical practices being done by honest and well-meaning scientists, critics might associate this with “p-hacking” and “cheating,” unfairly implying that these errors are the result of bad intent; and, conversely, various researchers who are using poor scientific practices but are not cheating will think that, just because they’re not falsifying data or “p-hacking,” that their research is just fine.
So, for both these reasons, I like the framing of the garden of forking paths: Why multiple comparisons, or multiple potential comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time.
Let me now continue with the news article under discussion, where Dahlberg writes:
The gold standard of scientific studies is to make a single hypothesis, gather data to test it, and analyze the results to see if it holds up. By Wansink’s own admission in the blog post, that’s not what happened in his lab.
Nononononono. Sometimes it’s fine to make a single hypothesis and test it. But work in a field such as eating behavior is inherently speculative. We’re not talking about general relativity or other theories of physics that make precise predictions that can be tested; this is inherently a field where the theories are fuzzy and we can learn from data. This is not a criticism of behavioral research; it’s just the way it is.
Indeed, I’d say that one problem with various flawed statistical methods associated with hypothesis testing, is the mistaken identification of vague hypotheses such as “people eat more when they’re served in large bowls” with specific statistical models of particular experiments.
Please please please please please let’s move beyond the middle-school science-lab story of the precise hypothesis and accept that, except in rare instances, a single study in social science will not be definitive. The gold standard is not a test of a single hypothesis; the gold standard is a clearly defined and executed study with high-quality data that leads to an increase in our understanding.
Dahlberg continues:
To understand p-hacking, you need to understand p-values. P-values are how researchers measure the likelihood that a result in an experiment did not happen due to random chance. They’re the odds, for example, that your new diet is what caused you to lose weight, as opposed to natural background fluctuations in myriad bodily functions.
Look. If you don’t understand p-values, that’s fine. Lots of people don’t understand p-values. I think p-values are pretty much a waste of time. But for chrissake, if you don’t know what a p-value is, don’t try to explain it!
And then:
P-hacking is when researchers play with the data, often using complex statistical models, to arrive at results that look like they’re not random.
Again, my problem here is with the implication of intentionality. Also, what’s with the “often using complex statistical models”? Most of the examples of unreplicable research I’ve seen have used simple statistical comparisons and tests, nothing complex at all. Maybe the occasional regression model. Wansink was mostly t-tests and chi-squared tests, right?
The article continues, quoting a university administrator as saying:
We believe that the overwhelming majority of scientists are committed to rigorous and transparent work of the highest caliber.
Hmmmm. I think two things are being conflated here: procedural characteristics (rigor and transparency) and quality of research (work of the highest caliber). Unfortunately, honesty and transparency are not enough. And, again, I think there’s a problem when scientific errors are framed as moral errors. Sure, Wansink’s work had tons of problems, including a nearly complete lack of rigor and transparency. But lots of people do rigorous (in the sense of being controlled experiments) and transparent studies, but still are doing low-quality research because they have noisy data and bad theories.
It’s fine to encourage good behavior and slam bad behavior—but let’s remember that lots of bad work is being done by good people.
Anyway, my point here is not to bang on the author of this news article, who I’m sure is doing his best. I’m writing this post because I keep seeing this moralistic framing of the replication crisis which I think is unhelpful.
Whassup?
People just love a story with good guys and bad guys, I guess.
Thinking of Wansink in particular: At some point, he must either have realized he’s doing something wrong, or else he’s worked really hard to avoid confronting his errors. He’s made lots and lots of claims where he has no data, and he continues to drastically minimize the problems that have been found with his work.
But Wansink’s an unusual case. Lots of people out there are trying their best but still are doing junk science. And even Wansink might feel that his sloppiness and minimization of errors are in the cause of a greater good of improving public health.
Fundamentally I do think the problem is lack of understanding. Yes, cheating occurs, but the cheating is justified by the lack of understanding. People are taught that they are doing studies with 80% power (see here and here), so they think they should be routinely achieving success, and they do what it takes–and what they’ve seen other people do–to get that success.
Now, don’t get me wrong, I’m frustrated as hell when researcher hype their claims, dodge criticism, and even attack their critics—but I think this is all coming from these researchers living in a statistical fantasy world.
I was corresponding with Matthew Poes about this, and he responded:
I agree worth you that the action is not typically intentional, or not done with ill-intent. I started to take issue with this around 2010 when I took over the analysts team at a lab at University of Illinois. It had become trendy to practically torture this dataset known as the Illinois Youth Survey (IYS) for some findings. Everything was correlated with everything and then grand theory-less models were run. While I had plenty to complain about here, my biggest fight was over the building of theory from what i saw as spurious correlations. Bizarre findings likely driven by a handful of cases or even possible random chance was causing the chaining of characteristics into policy to reduce the drug problem among youth. I’ve discovered you as a result of that. I made a hallmark of my current research model that we expect model developers (of interventions) to start with measurable fundamental active ingredients. That the outcomes they choose to impact can clearly and logically be linked to the active ingredients through a sensible theory of change. My most recent ranting issue has been the effectiveness trial like QED’s making use of propensity score matching, worse yet using them to further derive “unbiased” subgroup estimates. I don’t actually like that method much to begin with, but it’s being terribly abused right now. I don’t know your feelings on the matter. I’m currently trying to work with ACF to develop some standards around the use of statistical control group balancing methods as well as consider pulling together a committee to even decide if they have merit. There are other matching methods I think hold promise so I’m not totally against them, but at a minimum people need to use them appropriately or risk junk science.
As David Randall wrote, let something good come from Mr. Wansink’s downfall. Let’s hope that Wansink too can use his energy and creativity in a way that can benefit society. And, sure, it’s good for researchers to know that if you publish papers where the numbers don’t show up and you can’t produce your data, that eventually your career may suffer for it. But what I’d really like is for researchers, and decision makers, to recognize some fundamental difficulties of science, to realize that statistics is not just a bunch of paperwork, that following the conventional steps of research but without sensible theory or good measurement is a recipe for disaster.
With clear enough thinking, the replication crisis never needed to have happened, because people would’ve realized that so many of these studies would be hopeless. But in the world as it is, we’ve learned a lot from failed replication attempts and careful examinations of particular research projects. So let that be the lesson, that even if you are an honest and well-meaning researcher who’s never “p-hacked,” you can fool yourself, along with thousands of colleagues, news reporters, and funders, into believing things that aren’t so.