Almost 40% of peer-reviewed dietary research turns out to be wrong. Here’s why
iStock / CSA Images
iStock / CSA Images
There’s a reason everyone’s confused about whether coffee causes cancer, or whether butter’s good for you or bad. Food research has some big problems, as we’ve discussed here and here: questionable data, untrustworthy results, and pervasive bias (and not just on the part of Big Food). There’s reason to hope that scientists and academic journals will clean up their acts, and that journalists will refine their bullshit detectors and stop writing breathlessly about new nutrition “discoveries” that are anything but. Until that happens, though, we all need to get better at filtering for ourselves.
A pair of recent articles coming out of the statistical community offers a terrific tool for doing just that—not a long-term fix, but a little bit of much-needed protection while we wait for something better. To understand it, though, we’re going to have to dip our toes into some chilly mathematical waters. Stick with me. It won’t be too bad.
Let’s look at three recent reports of scientific findings about diet:
- Fifty grams of prunes a day prevents the loss of bone mineral density in elderly women with osteopenia
- Forty-eight grams of dark chocolate modulates your brainwaves for the better.
- Feeding infants puréed pork causes them to put on more body length than feeding them dairy.
They’ve all been peer-reviewed. All the findings have been declared to be statistically significant. And they all imply a clear cause-and-effect between a common food and a health outcome. And yet we know that there’s a good chance that at least one of them—and maybe even all three—will subsequently be proven to be false. So which ones does it make the most sense to ignore?
When two wrongs make a right
Here’s the problem with many of the nutrition studies you’re likely to read about in the press: Like most research, they’re carried out using an incredibly counterintuitive method called “null hypothesis testing.”
It goes like this. First, you start with whatever it is you’d like to prove—say, that drug X cures cancer. But then, instead of trying to prove your hypothesis directly, which is virtually impossible in the real world, you posit its opposite. For example: “I’m trying to prove that any connection between using drug X and curing cancer is just a matter of random chance.” That somewhat confounding non-statement is your null hypothesis.
Then you run your experiment and analyze your numbers. If you’re lucky, you’ll find that there’s not enough evidence to prove no connection between taking drug X and curing cancer. (Confusing, right?) Put another way, you’ve proven that the connection between drug X and cancer cures is not a matter of chance. Therefore, the thinking goes, drug X must cure cancer.
Null hypothesis testing can be hard for a lay audience to comprehend or, ultimately, to swallow. And it’s hardly the only—or the best—way to construct an experiment, as any statistician will tell you. But for almost a hundred years, since the publication of Sir Ronald Aylmer Fisher’s incredibly influential Statistical Methods for Research Workers, it’s been the one that every budding scientist learns.
Which is partly how we’ve gotten into this mess.
What are the odds?
The key to making this strange system work is knowing how much evidence it takes to prove or disprove a null hypothesis. How do you know your results are statistically significant? There are actually lots of techniques for figuring that out, many of them recondite and complex. But the workhorse in much practical research is something called the P-value. (The P stands for “probability.”)
P values are calculated using a combination of your experimental data and the assumptions you make in constructing the experiment. Values fall between zero and one. A low P-value is good—it means that your results are less likely to occur by chance. A high value is bad—it means that they were more likely to occur by chance.
For decades, there’s been a convention that test results with a P-value of 0.05 or lower are statistically significant, that is, worth believing. The way it’s typically explained is that a P of 0.05 means there is only a 5 percent chance—1 in 20—that the results you’re looking at would have been produced by chance. In other words, an acceptable chance.
The trouble is, that’s not what P = 0.05 actually means. It turns out there’s rather a large gap between the way statisticians define P and the way the rest of us use it. When you read about P value in a nontechnical setting, you’ll encounter a lot of explanations that are fairly clear and reasonable, but apparently they’re all wrong. (If you want all the grisly details, they’re here.)
A lot of folks think is that a P-value of 0.05 means that there’s a 95 percent chance that your hypothesis (the one you’re really testing, not the null hypothesis) is true. But that’s not actually the case. It’s a little technical, but P-value only refers to the probability that you’d achieve those results if the null hypothesis and all the other assumptions you made going into the project were all true.
“Not only does a P-value not tell us whether the hypothesis targeted for testing is true or not; it says nothing specifically related to that hypothesis unless we can be completely assured that every other assumption used for its computation is correct—an assurance that is lacking in far too many studies,” write a group of scientists concerned about the widespread misuse of P-value, in The American Statistician.
It may sound crazy, but don’t forget, when the statistician John Ioannidis studied the actual accuracy of highly cited scientific papers in major journals, he got even worse results: Almost 40 percent of the studies he looked at were subsequently proven wrong. Should we be guiding our lives with this stuff? Hardly.
A Better P, or Something Better than P?
To be fair, the statistics community has known about the P problem for decades. The American Statistian statement I referred to above strongly denounced the incorrect (and incredibly common) use of P values to judge the validity of hypothesis. The article offered some alternatives, which are too technical to get into here. But any solution is going to take a concerted effort—one that will require hundreds of thousands of researchers and the journals they publish in to start taking the statistical part of science a lot more seriously, and likely decades to implement.
So what do we do until then?
This January, 72 luminaries in statistics made a proposal in Nature Human Behavior. Since we can’t quickly (or maybe ever) eliminate null hypothesis testing based on P values, let’s at least shift to a more appropriate P-value: Instead of granting statistical significance to a P of 0.05, let’s use 0.005, a half percent instead of 5 percent. Research with a P between 0.005 and 0.05 would be regarded as “suggestive” rather than significant. (The suggestion applies specifically to new discoveries. Follow-on research, where knowledge is deeper, would be treated differently.) The step, argued the authors, should cut the false positive rate down to around 5 percent—which is the rate we thought we were getting with P= 0.05. Last month, John Ioannidis, an influential guy, took the argument to the pages of the Journal of the American Medical Association. It’s a temporizing solution, he argued, but a necessary one.
Will scientists and journals take this straightforward step to improve health and medical science (and of course nutrition) research? We’ll see. A couple of journals have already taken the step. And some specialties already use much more rigorous standards. Population genomics, for example, uses a cutoff of 0.00000005.
Repeat After Me: I Don’t Care
Now, I’m pretty sure that the 72 statistical luminaries meant their advice for scientists and scientific journals. But until those folks get with the program, we might as well use it ourselves.
What would that mean for us? Well, let’s look at the studies we began with:
- Prunes preventing the loss of bone mineral density in elderly women with osteopenia. P < 0.05. Suggestive, but not significant. Not interested.
- Chocolate for your brainwaves? They looked at various brain waves, and the strongest statistical association was 0.01. And they only tested four people. Not interested.
- Feeding infants puréed pork and increased body length? Trick question. P = 0.001—so something really seems to be happening. But does greater body length in infants actually matter? We’ll let the scientists pursue this one on their own until they come up with something that’s not just statistically significant but meaningful.
Let’s be clear: I’m not saying that all research with a P of greater than 0.005 is false. It’s not. And as time goes along, studies are going to be published with much more sophisticated statistical analyses that will render the P < 0.005 strategy irrelevant. (Check back at that point and we’ll try to catch you up.)
Meanwhile, of course, you can eat whatever you please. The point isn’t to prevent you from snacking on prunes and chocolate while you shovel puréed pork into the baby. Do it if you want to, and given the marvelous powers of the placebo effect, you’ll probably be happy you did. But stop treating studies like these as if they contain the truth. They may be a step on the path, but in many cases, the total voyage will be long and uncertain.
Your job here, should you choose to accept it, is to ignore a huge percentage of the food research you read about. You want to send the message that journalists need to find better cheap, cute stories to lure you in, that university press departments need to find better subjects for press releases, and that you’re not going to put up with any hanky-panky from food bloggers and TV doctors. Until you see that 0.005, you need to be a cold, immovable lump of stone.
Sure it sounds harsh, but it’s for science. Are you with me?