The corpus of scientific literature needs a drastic clean-up
Like every other practising scientist, my relationship with the literature is rather fraught. I go to the vast collection of chemistry and biology papers for information, for instruction, and for answers, because where else is there? But the whole time I’m searching through it, I find myself wondering first what important things I might be missing, and second, how much of what I’m reading is incorrect (or at best incomplete).
We’ve all come across papers that are wrong. And of those wrong ones, a few go beyond the ‘honest mistake’ category into outright fraud. That mass of false or unreliable data makes the task of extracting useful guidance even more complicated. Scientists have been telling themselves for a long time that the literature is self-correcting. And while that statement is not completely wrong, it isn’t completely right, either. Not by a long way.
To begin with, there are naturally a huge number of older outmoded papers that most of us don’t even seriously consider. If the only report of some molecule is from 1928, I am unlikely to track it down to see how the authors tried to characterise it by melting point, appearance, and probably by taste. I would assume that the machine learning algorithms are given a cutoff to make the really old literature vanish. But there are more recent difficulties. For example, a great deal of the kinase selectivity literature from the 1990s and early 2000s is quite unreliable, because there weren’t enough selective inhibitors to really trust those numbers. I would not want my Grand Kinase AnswerBot to either be trained on them, or try to extract any sense from them.
But there are larger and more subtle problems than wrong or outmoded papers. I wrote here last year about the problems with negative data (not enough of it, because we pretend that we don’t have much) and biases in selection of reactions and reagents. The molecular and cell biology literature has the same issues, and on top of that those fields honestly have many more variables than organic chemistry. Two papers can look as if they’re talking about the same thing, but differences in cell lines, culture conditions, protein purification and refolding, assay buffers, orders of addition and many more can make their results non-comparable.
So what’s to be done? Perhaps some idle billionaire could throw some money at the idea of a Revised Scientific Corpus, a literature collection stripped of faked Western blots, non-reproducible reaction conditions, outdated assays, and control compounds that we now know don’t control for much of anything. As appealing as that idea is (from some angles, anyway) it would become unworkable rather quickly: after what time period or number of citations, for example, do we consider a paper reliable? Curation would, by definition, be a continuous task. This would be as far from a one-and-done cleanup as one could imagine. I mean, the dumper trucks are rumbling in and depositing piles of new results every day of the week.
A more lasting solution would be to address some of the underlying reasons that the literature has so many problematic regions. I would argue that there are too many papers being published overall, and they are being written too quickly to fill the pages of too many repetitious journal titles. And the reason for all this is that too many institutions make publishing – widely and often – a major criterion for hiring and advancement. This is particularly a problem in academia, although it’s not limited to that area, and it’s particularly a problem in some countries that have made explicit paper-counting a part of their system. I cannot blame people for trying to get a job, keep a job, or get promoted at said job, but I can blame the near-mindless use of publications as a proxy for evaluation. You want papers? You insist on papers? Papers, then, are what you will get, and plenty of them. They may be ground out by AI-assisted ‘paper-mill’ operations, with authorship sold to the highest bidders and favorable referee reports organised in advance for an additional fee, but by gosh, they are papers.
That’s how we have put ourselves in the situation we’re in. It would be better for science, better for human knowledge, and also better for all the people devoting their time to such efforts, if we found another way to do it all.
No comments yet