They barely even begin to address the problem
Last month the journal Science published an article titled “Fraud, so much fraud”. It discusses their exposé of Eliezer Masliah, an Alzheimer's scientist and US government official who has been accused of systematically faking over 25 years worth of research. Such scandals are now common, like this one at a big cancer research center and this one at another Alzheimers lab.
The problem was discussed on Hacker News, where people exchanged ideas for how to improve the situation. In these debates someone will usually suggest an intuitive sounding fix: the government should fund scientists to repeat other people’s experiments, just like it funds any other kind of study. For example,
“We can solve this at the grant level. Stipulate that for every new paper a group publishes from a grant, that group must also publish a replication of an existing finding.”
This sounds like it should work; the science crisis is usually called the replication crisis, after all. It stands to reason that having an independent team do the same experiment and get the same results should prove the absence of problems with that study.
This idea is reasonable. Unfortunately, science isn’t.
In the past eight years or so I’ve read a lot of junk-tier scientific papers. It wasn’t intentional — there’s just so much out there that if you dig into claims found in mass media you’re going to encounter pseudoscience very quickly. I’ve written on this blog about the so-called “questionable research practices” in medicine, epidemiology (twice), “twitter bot studies” (also twice), archaeology and PCR testing. The weird selection of fields just reflects my personal interests — I don’t actually go looking for this stuff.
Of the bad claims I’ve seen, almost none would have been prevented by a replication study. Some would have been exacerbated. This is a very counter-intuitive claim. How can replication not help, let alone make things worse? Well, there’s a bunch of reasons. Don’t get me wrong: replication isn’t entirely useless, but it can only help rarely and doing more of it could actually pour fuel on the fire.
Replicating wrongness = more wrongness
The first and biggest problem with replication is that it only helps in one very specific scenario. You need to have a paper which:
- Makes an important, interesting claim that would be Big If True.
- Has a logical, detailed, and scientific methodology clearly derived from the claim.
- Is free of other obvious flaws.
This is a low bar. Some fields are better at clearing it than others. Computer science usually produces papers that pole-vault over it, thank goodness. But in many other fields the median paper doesn’t even reach the point where replicating it would make sense. Instead these papers:
- Make trivial claims e.g. the average man would like to be more muscular or people choose their clothes based on how warm it is (“It is evident that further studies are needed in this field” 🤮).
- Have nonsensical methodologies like defining anyone who tweets five times after midnight as a “bot”, or which involve ad-hoc criteria presented “just so” instead of a description of how they were derived.
- Are full of obvious errors that render replication pointless. This COVID paper has 19 authors yet the very first sentence is a false claim about public statistics.
Replicating the definition of replicating
The second problem is that people don’t agree on what the word replication means. It might seem obvious: surely it’s “we did the same things and got the same results”? But when careers are on the line, people find ways to disagree.
One case arises when scientists make a claim of the form “we did this thing and that other thing changed in a big way”. Someone tries to replicate this and they find the other thing changed but only by a bit, or they discover the effect fades with time (this is especially common in education studies for some reason).
Is this a successful replication? Scientists are often tempted to argue it is — after all, the same effect appeared and the natural world is inherently noisy. You’ll never get exactly the same numbers. But often the reason a paper was interesting was its claim of a big and important effect. If it shrinks to the point nobody would care, it’s not really informative to give the paper the “it replicates” stamp of approval. Doing that would just incentivise the exaggeration of real but small effects.
Another frequent case is where a study fails to replicate but the replicators changed some aspect of the method they felt was unimportant. The original authors point to the difference and say that’s why it didn’t work.
But the most common replicability problem I’ve seen is that the methodology isn’t actually derived from the hypothesis. The study design itself isn’t replicable. Examples litter the field of misinformation research. Academics in this field like proving claims of the form, “conservatives believe more conspiracy theories than liberals”. But what exactly is a conspiracy theory? Invariably these studies define conspiracy theory as anything in an ad hoc list of items. There’s no explanation of how the list was created and such lists don’t include things believed by liberals, so the whole thing becomes a circular firing squad: conservatives believe conspiracy theories because conspiracy theories are things believed by conservatives. It’s just a form of scientific fraud, so a replication that followed the written methodology would simply perpetuate it and a replication that didn’t would be rejected by the original researchers as illegitimate.
The eternal sunshine of the spotless mind
A less common but more damning situation occurs when scientists pretend something replicated even though it didn’t.
In 2020 Neil Ferguson shocked the world with a paper claiming COVID would spiral out of control unless there was an immediate and draconian lockdown. The prediction of viral spread wasn’t from a model formally written down in some published paper: it turned out to exist only in a single unpublished program. He helpfully explained that he never revealed the code before because it was “all in my head, completely undocumented. Nobody would be able to use it”.
Perhaps another explanation is that the program simply did not work. I wrote an article laying out the gruesome details. Bugs created unintentional non-determinisms: running the simulation on a different computer would calculate totally different results for identical scenarios. It was like if two people opened the same spreadsheet in Excel and saw totally different numbers.
This case is interesting because it’s a slam-dunk failure to replicate. There is quite simply no wriggle room here: two people solving the same equations with the same inputs should get the same results. If they don’t then something has gone horribly wrong.
It should have been a massive scandal. It wasn’t because Imperial College London published a press release saying a replication study was done and then sent it to some pet journalists, who promptly reported that the concerns were a false alarm. The supposed replication starts by saying “I was able to reproduce the results from Report 9”, but then the very same paragraph admits that every number the author got was different, some by up to 25%. He even admitted that one reason for the differences was that “the CovidSim codebase is now deterministic” i.e. he didn’t try and replicate the version he was supposed to be replicating because that version was inherently non-replicable to begin with.
In fields like public health the whole concept of replication seems to be barely understood. Circular logic runs rampant. It’s common to conflate replication and validation by claiming that a model is valid if its outputs roughly match the output of other models. Things taken for granted in machine learning, like withholding data to create a test set, are considered controversial in public health (see the section “Comparisons with external data” in that last link).
Pushing on wet string
The Ferguson case reveals the core problem with funding replication studies. The idea takes as axiomatic that bad scientists are rare and so the chance of a replicator being an honest scientist is high. Yet in reality there are no consequences for anything, ever and so the dishonest rise to the top. You just assert you followed the rules and everyone believes you!
A fascinating, bizarre, stupid and frustrating example of this problem is the recently published paper, “High replicability of newly discovered social-behavioural findings is achievable”. They claimed they fixed the replication crisis by using better methods! Big If True. The paper has 17 authors, some of whom are involved with the science reform movement or have even accused other scientists of fraud, so this sounded very promising. Surely it will be a seminal paper worth rea….
Oh. Why was it retracted? Ah, y’know, because they did every single unscientific thing that’s known to create non-replicable papers. They p-hacked the study by coming up with a different hypothesis after collecting their data, a practice that’s supposed to be stopped by pre-registering their plans, which they did … but then they “mis-stated” the contents of their pre-registration. Of course nobody at the journal checked.
So this is a study arguing for rigorous methods that breaks the rules they’re supposedly fighting for. And yet somehow, it goes downhill from there. If you have a weak constitution stop reading now, because the original study they planned was an attempt to test the existence of paranormal forces that screw with scientist’s experiments. Yes, Virginia. These wannabe Mulders hypothesized that maybe social science has poor replicability not because lots of them suck at science but because the act of studying a phenomenon somehow makes it disappear.
Perhaps you think this part is so absurd I must be making it up, so here’s an explanation by some of the authors:
Other presenters offered unconventional explanations [for observing weaker evidence in replications], such as the act of observation making effects decline over time, possibilities that Schooler believed worthy of empirical investigation . . . and others have dismissed as inconsistent with the present understanding of physics.
The researchers helpfully explain that they knew this was stupid but went along with it to get money from Schooler:
While the unconventional explanations were not considered plausible (or even possible) to most of the team, they agreed on an approach that included tests of those possibilities
The guy who discovered all of this wrote a long blog post describing how hard it was to bring this “utterly batshit supernatural framing” to light, and how everyone seemed to minimize the problems or pretend it’s all an accident (which he does not agree it was). Andrew Gelman also has a long writeup of this disaster.
Conclusion
Any proposal for science reform that involves paying for replication studies needs to first address at least these problems:
- Many papers don’t make sense to replicate because they’re pointless, can be proven wrong just by reading them or because the derivation of the methodology is itself non-replicable.
- Some fields use weak, nonsensical or circular definitions of “replicate”.
- Some academics are willing to claim things replicate even if they don’t in order to preserve the influence of their field.
- Scientists who make a big deal out of their high standards might actually not have any.
What’s at the root of all this? At some point we must look to the incentives as the cause. Science exists in a strange Soviet-esque system in which philanthropists and taxpayers allocate funds into a planned economy. Perhaps the odd inability to get rid of paranormal research is unsurprising in that context, as the Soviets were also big into parapsychology, funding it to the tune of half a billion dollars per year.
Without practical technological/commercial goals to ground scientific exploration it appears that the latter can easily enter a downwards spiral of ever lower standards, until eventually they just disappear entirely. There’s no One Weird Trick to fix bad incentives; only changing them can. Although it’s tempting to look around for ways to let governments continue spraying universities with cash, it’s worth remembering that we’ve only been mass funding academic science since the end of World War 2. Some of the most productive and innovative periods of history predate the expansion of universities — the Industrial Revolution was powered almost entirely by individual inventors and companies protected by patents. Is it time to revisit this model?
Replication studies can’t fix science was originally published in Mike’s blog on Medium, where people are continuing the conversation by highlighting and responding to this story.