Psychology May Regain Some Of Its Lost Credibility
Claims in 2015 that half of all psychology papers are not reproducible may have been premature, new study shows
Psychology may have just regained some of its lost credibility. In a new study in the journal Science, researchers evaluate the 2015 claim that half of all psychology studies are so flawed they cannot be replicated—and beg to differ.
“They never had any chance of estimating the reproducibility of psychological science,” said Daniel Gilbert, professor of psychology at Harvard and coauthor on the new study that thrashes the 2015 paper’s methodology, in a press statement. “If they had used these same methods to sample people instead of studies, no reputable scientific journal would have published their findings.”
The disappointing story of the replication study that couldn’t be replicated begins in the summer of 2015, when psychologist Brian Nosek and his team at the Open Science Collaboration published a controversial paper, claiming that they had surveyed 100 prominent psychology papers and found they could replicate the findings of a mere 39 percent. The results rocked the research world.
Everyone covered it. Vocativ covered it twice. The implications were astounding. Scientists demanded stricter protocols for vetting studies and, incredibly, a few top journals actually responded by slapping new safeguards in place to prevent bad science from creeping into the mainstream. The kerfuffle even started a conversation about the integrity of statistical analyses and the ever-present specter of publication bias. More than one publication named Nosek’s study one of the most important science stories of 2015.
Unfortunately, the narrative began to crumble once other psychologists tried to replicate Nosek’s own findings, and couldn’t. “We show that this article contains three statistical errors and provides no support for such a conclusion,” the authors of the new study write. “Indeed, the data are consistent with the opposite conclusion, namely, that the reproducibility of psychological science is quite high.”
The authors of the new study stress that the nascent field of meta-science—the so-called “study of studies” that Nosek employed to test psychological reproducibility—has many of the same pitfalls as any other type of research. “Meta-science does not get a pass. It is not exempt,” said Gary King, psychologist at Harvard and coauthor on the study, in a press statement. “If you violate the basic rules of science, you get the wrong answer, and that’s what happened here.”
So, what exactly happened here? The authors spill a lot of ink on the statistical and methodological problems with Nosek’s study, but they have two central complaints:
Poor Replication By The Replicators
Most of us who read Nosek’s original study assumed that the researchers looked into the Methods section of each paper in question and carefully replicated the exact study, with as few deviations and possible, to see whether they’d arrive at the same conclusion. That’s how replication usually works.
But Gilbert claims that in the 2015 paper, “many of the replication studies differed in truly astounding ways—ways that make it hard to understand how they could even be called replications.”
For instance, one of the original studies that Nosek evaluated had assessed how white students at Stanford University respond to racism during an in-person discussion of affirmative action. When replicating the study, Nosek asked Dutch students, some of whom barely spoke English, to watch a video recording of Stanford students speaking in English about affirmative action policies that had no relevance to them.
It gets worse. “The replicators realized that doing this study in the Netherlands might have been a problem, so they wisely decided to run another version of it in the U.S. And when they did, they basically replicated the original result,” Gilbert says. “And yet…they excluded the successful replication and included only the one from the University of Amsterdam that failed.”
Inexcusably Bad Statistics
When you run a scientific study, occasionally you arrive at the wrong conclusion. Any number of confounding factors or minor mistakes can cause these outcomes—that’s why we run studies multiple times and, in fact, that’s why replication is so important to science. It helps us prevent random results from becoming the basis for medications, theories or, in this case, psychological protocols.
“If you are going to replicate a hundred studies, some will fail by chance alone. That’s basic sampling theory,” King explains.”So you have to use statistics to estimate how many of the studies are expected to fail by chance alone because otherwise the number that actually do fail is meaningless.”
As you may have guessed, Nosek and his team did not do that. Instead, they replicated each study only once—which means that, whenever Nosek’s results contradicted the original study, there was no way of knowing whether that happened because the original study was bad or because Nosek’s replication wasn’t up to snuff.
“So we did the calculation the right way and then applied it to their data. And guess what? The number of failures they observed was just about what you should expect to observe by chance alone—even if all one hundred of the original findings were true,” King said. “The failure of the replication studies to match the original studies was a failure of the replications, not of the originals.”
In response, Brian Nosek, author of the original 2015 paper, claims that his statistics are in fact sound and he maintains that many—if not most—psychological studies still cannot be replicated. He explains some of his thinking in a formal response, also published in Science, but the general takeaway is that, while Gilbert, King and colleagues may not have hit the nail on the head with every one of their sundry statistical complaints, Nosek seems to concede that his study could have been more robust.
“Both optimistic and pessimistic conclusions about reproducibility are possible, and neither are yet warranted,” writes Nosek, and his team at the Open Science collaboration.
Gilbert and colleagues are also willing to meet Nosek halfway—they concede that it is unlikely the Open Science Collaboration intentionally fudged their data to make psychology look bad. “Let’s be clear: no one involved in this study was trying to deceive anyone,” Gilbert said. “This is not a personal attack, this is a scientific critique … We were glad to see that in their response to our comment, the OSC quibbled about a number of minor issues but conceded the major one, which is that their paper does not provide evidence for the pessimistic conclusions that most people have drawn from it.”
Still, Gilbert worries that Nosek and his team have unwittingly spread misconceptions about the field, and hopes that they will try to undo some of that harm. “This paper has had extraordinary impact,” Gilbert said. “So it is not enough now, in the sober light of retrospect, to say that mistakes were made. These mistakes had very serious repercussions. We hope the OSC will now work as hard to correct the public misperceptions of their findings as they did to produce the findings themselves.”