As I write this, we have recently concluded the Year of Machine Learning Nobel Prizes. Who knows, perhaps there will be another in 20 years, but this was the first. As someone who has a reputation for being an AI/ML skeptic (which I sometimes think is the default category for anyone who actually reads the press releases all the way to the end), I should say immediately that I think that all of the prizes this year were well-deserved.

Person carrying a question mark with a binary code background

Source: © Getty Images

Machine Learning can produce great insights when trained on high quality data, but was unleashing it on the messy world of language a step too far?

That’s because machine learning truly is useful. The quality of the protein structures produced by modern software are extremely impressive, and having these things ready for almost any situation has in turn made all sorts of new research possible. And there are many other ML examples in materials science and other fields where these techniques are of great use.

It’s important to remember, though, that there is a key feature shared by all of these marquee applications: the machine learning algorithms have been turned loose on large, diverse, and extremely well-curated data sets. That’s where the protein prediction results came from, via absorbing the experimental information gathered over many years in the Protein Data Bank. Similarly, large high-quality data sets in crystallography, metallurgy, metal complexes, framework materials and more are equally good candidates for such an approach.

Sadly, not every area offers such well-suited candidates. The organic chemistry literature is probably in that category: it has too many problems with reproducibility, too many missing negative results and too many subtle biases in reaction choice and reaction conditions to make it (as it stands) a really solid prospect for ML. Another terrible candidate (if you ask me) is human language. By this I mean the Large Language Models (LLMs) such as ChatGPT and its successors and competitors.

That may sound like a perverse thing to say, given the immense popularity of these systems and their growing pervasiveness. In the same way that they picked up on the way that this amino acid followed by that one and then that one over there tend to make a turn that goes like this, LLMs have learned that this phrase tends to use nouns like these over here, and is often followed by a statement of this form with one of these verbs here, and so on. So what’s the difficulty?

Machine learning does not produce new knowledge so much as reveal knowledge that was hidden too subtly for us to grasp

Well, language is a problem with a lot more degrees of freedom in it than protein structure has. A protein sequence’s main purpose is to produce that structure, after all, but language has many more subtle ends and a huge array of vocabulary and grammatical forms in which to produce them. Trying to pattern-match your way through this thicket produces superficially correct nonsense. For example, as I write, a Google search of the phrase ‘French word popover’, returns an AI summary that informs me confidently that the French word for ‘popover’ is in fact ‘Yorkshire pudding’. I’m sure that will be fixed by the time you read this – it had better be – but the underlying problem will still be there.

Perhaps it cannot be fixed at all. Human sentences have more than twenty-one words to choose from (as opposed to amino acid sequences), and they are not merely swappable modules. And there is a sort of conservation law at work: machine learning does not produce new knowledge so much as reveal knowledge that was hidden too subtly for us to grasp. No LLM can tell you something new except by accident. They are text-rearrangers of things that actual humans have written, and the degree to which they produce plausible sentences probably says something complimentary about their programmers and something uncomplimentary about the rest of us.

So long as ML algorithms are turned loose on well-bounded problems with a great deal of hard and accurate data to work with, they can produce plenty of worthwhile results. But we probably should have resisted the temptation to set them loose on language itself. There used to be a team that monitored internet content to determine the word frequencies in English and other languages, looking for changes over time. They have given up. Too much of the content they were taking in became chatbot-generated – we have poured so much sludge into that pool that it will never be clean again. Will we keep doing that?