Chemical structures will be harvested from graduate work and analysed for promising drug and materials leads

Shutterstock

The new project aims to ensure that organic chemistry theses don't lie forgotten at the back of the cupboard

The feeling that the vast majority of information contained within your painstakingly prepared thesis is destined to be forgotten will be familiar to many a PhD student. But this may be about to change for synthetic chemists, thanks to a team of researchers who have amassed a digital collection of more than 75,000 compounds from PhD theses that might otherwise have mouldered in obscurity.

The effort is part of a Royal Society of Chemistry (RSC) funded pilot project to build-up a national compound collection. In its current form the collection exists in silico, with structures stored on the RSC’s chemical database ChemSpider. The collection is not yet accessible, but will be made into an open resource under a Creative Commons licence once the evaluation process for the pilot process is complete.

‘This should be a resource that is available to anybody whose work depends on using molecules,’ says Tim Gallagher, a chemistry professor at the University of Bristol, UK, who is one of the project’s coordinators. He explains the project grew from a much smaller initiative run within the university to try and ‘get the molecule makers talking to the molecule users’ by gathering data from synthetic chemistry PhD theses.

‘[Theses are] published documents, so they’re in the public domain already and they’re largely IP-free,’ says Gallagher. He says that past efforts to set up a physical collection of compounds that could be used as a resource have run into trouble because of concerns with intellectual property, as well as the huge amount of work that would be involved in synthesising and curating samples. Keeping the collection virtual and using data from published theses gets around a lot of these problems. ‘Here’s a way that we can actually demonstrate the value of [publically funded] research,’ he says. ‘All this stuff has been sitting on shelves gathering dust basically – why not put it to good use?’

Collecting structures

Gathering the data required collaboration with 15 UK university chemistry departments, who donated 750 synthetic chemistry PhD theses. A group of 11 ‘data collectors’, led by University of Bristol chemist Laura Broad, went through each thesis, manually extracting and re-drawing new chemical structures, before entering them into ChemSpider. The process took just over four months, running from February to June last year.

‘During this period they managed to collect 45,000 compounds,’ says David Andrews, an industry associate who is leading the RSC’s side of the project. Using separate entries for different enantiomers and diastereomers, he explains, produced a total of 75,000 new entries, 70% of which were completely new to ChemSpider.

All this stuff has been sitting on shelves gathering dust

Just as important as collecting new structures was evaluating the potential ‘usefulness’ of the collection, so the project also included a virtual screening phase. ‘You can take a virtual database and screen that against interesting and important proteins, so you can effectively see if your molecules bind to [them],’ says Gallagher.

The collection was screened against 32 protein binding sites donated by partners across academia and industry. Several ‘hits’ against protein targets were identified, which are now being made available to the ‘owners’ of those proteins. In this case the team chose to focus on the biological activity of the molecules, but Gallagher explains they could also be screened for favourable properties in other areas such as novel materials.

The team also collaborated with chemical informatics firm NQuiX to assess chemical ‘diversity’ – the uniqueness of structures within the collection – which was found to be relatively high when compared with other collections of compounds, with around 2000 highly novel structures identified. Again this suggests the collection has the potential to be a useful resource.

Looking ahead

Now that the pilot has been completed, the team are keen to continue the project and expand the collection. There is even the potential to start a physical collection of compounds to accompany the database, but this route remains fraught with issues.

‘You’ve got to find a place to store them all, to curate them, to put them into a format that people can use to distribute them, so there are all sorts of downstream issues,’ says Gallagher. Regardless of whether this can ever go ahead, he adds, maintaining a virtual database will still have value. ’Any physical collection would still need to exist in silico. But the in silico [collection] doesn’t require the physical counterpart, at least not immediately. They could go on in parallel.’

People spend a lot of time collecting this information – what are you going to do with it

But not everybody is convinced that expanding the collection would be worthwhile. ‘It’s a great idea bringing all this together but how does it really work?’ says Paul Wyatt, a chemist at the University of Dundee, UK, who was not involved in the project. ‘People spend a lot of time collecting this information – what are you going to do with it, what’s it adding? Is it the best way to spend the money?’

But Gallagher thinks the potential of expanding on the in silico collection is a valuable opportunity that should not be missed. ‘It’s true that screening programmes carried out by pharma require physical samples,’ he says, ‘but our ability to predict using computers is only going to get better, so I don’t think you’ll ever diminish the value of the in silico resource.’ In future, he adds, PhD students could be encouraged to enter their own structural data to the collection as they are writing up their thesis.  ‘Then you’re constantly adding to the collection.’

‘The pilot was a proof-of-concept, and the outcome of the screening was positive. We have confidence that we can take this work forward,’ says Andrews. The group have produced a report on the pilot which will be published this year. Then they can begin planning the next stages.

‘We’ve barely scratched the surface of the number of theses that are sitting out there,’ says Gallagher. ‘This is a huge opportunity to demonstrate the impact of UK chemistry.’