Researchers at UK universities have adapted the text-based machine learning system GPT to predict crystal structures of inorganic materials, even though it can’t directly represent three-dimensional structures. Such an artificial intelligence (AI) model could eventually be faster than traditional prediction methods and doesn’t need to specifically learn physics or chemistry rules.

Programming and crystal structure

Source: © Ella Maru Studio

The team’s tool, CrystaLLM, is a large-language model (LLM) crystal structure predictor, initially described in a preprint in July 2023. It could help speed up material discovery, though it requires further training in specific tasks. Other scientists have already done this, adapting it to predict how molecules adsorb on catalysts, explains Ricardo Grau-Crespo from the University of Reading, UK.

Grau-Crespo and University College London’s Keith Butler supervised PhD student Luis Antunes, who was working on materials chemistry problems in late 2022 when OpenAI released ChatGPT. He wondered how it would handle data representing how atoms are arranged in inorganic crystals. ‘My supervisors were sceptical it would work and so was I, to be honest,’ Antunes tells Chemistry World. ‘But after some promising initial results, we found that we were on to something.’

Antunes first converted words, digits and punctuation marks of 2.2 million text-based crystallographic information files (CIFs) into numerical tokens. Antunes trained GPT-2 by giving it those tokens, asking it what should come next, and revising the model based on prediction inaccuracies. The revisions change the model’s parameters, values that reflect the strength of connections between ‘neurons’ in the AI system.

‘Over the course of training, the model gets better at predicting which token comes next,’ Antunes tells Chemistry World. ‘As a consequence, the model develops the ability to generate full CIFs when prompted with just a few tokens representing a chemical formula of interest.’

Antunes devised two CrystaLLM systems, a small one with 25 million parameters, and a large one with 200 million parameters. Training the small model took about three days on a single graphics processing unit (GPU), while training the large model took eight.

Both CrystaLLM versions perform similarly to existing AI inorganic structure prediction tool DiffCSP, also first published as a preprint in July 2023. DiffCSP is based on diffusion models more like image-generating AI DALL-E, rather than GPT. Both correctly predicted crystal structures of almost 19,000 perovskite materials after 20 tries. Likewise, both predicted around a third of a more complex set of 40,476 inorganic materials correctly after 20 tries.

Antunes says that CrystaLLM, DiffCSP and another diffusion model, CDVAE, can all run on one GPU. The small CrystaLLM model is very fast and can run ‘on a good laptop’, he explains, although DiffCSP and CDVAE ‘can be a little faster’.

Grau-Crespo warns that ‘while the CrystaLLM model can generate sensible structures, this does not by itself make it suitable, as is, for crystal structure prediction’. It requires further training to be used in specific tasks, which is known as fine-tuning. However, he notes researchers are already working on this task.

Chemists use LLMs in many ways, notes Aron Walsh from Imperial College London, UK, but generating new crystallographic information files wasn’t an obvious application. ‘CrystaLLM is a breath of fresh air in crystal structure prediction,’ he says. ‘While such an approach is not guaranteed to give the best solution, they can provide interesting and plausible structures very quickly.’ What would previously have taken months of structure searching before, can now be started in seconds, Walsh adds.

This team is far from the only one working on adapting LLMs to crystal structure generation, Grau-Crespo continues. ‘A few competing models have been now announced, including both LLMs and diffusion-based models,’ he says. ‘A lot of the work in this area is only available as preprints, as the peer-reviewing system is having a hard time catching up with the speed of the field.’