Is DNA the future of digital data storage?

No comments

DNA is being explored as a long-term solution to preserving digital information for future generations. Nina Notman reports

A portrait of Rosalind Franklin with a hidden twist hangs on the wall of the Bill and Melinda Gates Center for Computer Science and Engineering at the University of Washington in Seattle, US.

The portrait is five years old and is a black acrylic ink painting of Franklin over a collage of nearly 2000 photographs. These images are all snapshots of precious memories submitted by the public to Luis Ceze, a professor of computer science and engineering. But the real surprise lies in the medium used to paint Franklin. The acrylic ink contains synthetic DNA encoded with all the digital information needed to reproduce each photograph in the collage. ‘The pictures were encoded in DNA which was subsequently applied as ink to create a portrait of Rosalind Franklin, one of the pioneers in DNA research,’ says Karin Strauss, senior principal research manager at Microsoft Research in Washington, US.

A portrait of Rosalind Franklin made up of a collage of modern photos

Source: © Kate Thompson/Dennis Wise/University of Washington

Each picture in this composite portrait of Rosalind Franklin is also encoded into the ink using DNA

The idea of storing digital information in the pattern of adenines (As), thymines (Ts), cytosines (Cs) and guanines (Gs) in synthetic DNA has been floating around for decades. It offers a more compact and long-lasting alternative to binary code (the strings of zeroes and ones) used in traditional computing. The last dozen or so years has seen a flurry of robust examples of the storage of DNA data in action. Other demonstration projects include storing Shakespeare’s 154 sonnets, part of an audio file of Martin Luther King’s 1963 ‘I have a dream’ speech and the first episode of the Netflix series Biohackers.

‘This thought of storing digital data in DNA is not a fundamentally new concept, but it is becoming more and more viable,’ says Ceze. A major step forward was the formation of the DNA Data Storage Alliance in 2020. This large industry and academic collaboration is enabling the development of an interoperable storage ecosystem – where the technologies for each stage of data storage and collection process will be compatible. This will avoid a repeat of the videotape format war that saw the incompatible Betamax and VHS system go head-to-head in the late 1970s and 1980s.

A data solution

The logistics of DNA data storage and retrieval varies between demonstration projects but the main steps are the same. First, the data is encoded into a pattern of nucleotide bases, just as it is currently encoded into zeroes and ones. Multiple copies of DNA strands with this pattern of bases are then synthesised in the lab. Next, the DNA is stored for a period of time. To retrieve the information, the pattern of bases in the DNA is read using sequencing technologies (that were originally developed for genomic and medical research purposes). The DNA can be recovered from Ceze’s portrait of Franklin’s, for example, by scraping some paint off.

Storing information in DNA offers many advantages over current methods, with longevity being the most important. Information stored by our ancestors provides us with a window into their world. It is still possible to see cave paintings from prehistoric times, view hieroglyphs carved into rocks, and read 11th century books. In contrast, today’s storage mediums are not designed to last. The materials degrade fairly quickly and playback technology is rapidly outdated, making it difficult to retrieve data more than a decade or so old. How many of use still have the means to access data stored on vinyl records, cassette tapes, videotapes, floppy disks or zip disks? Most new laptops don’t even have a CD or DVD drive anymore. ‘Museums and big data storage companies are aware that there is this problem that we don’t really know how to store the data for a long time,’ says Robert Grass, a professor of functional materials at ETH Zurich in Switzerland.

A scheme showing how digital data is encoded into DNA

Source: © Robert Grass, ETH Zurich

The basic steps of encoding and decoding data using DNA are broadly the same for the various techniques

DNA is what nature has evolved to use to store the information that all living organisms need to grow, reproduce and function. Stored correctly, it will last many thousands of years. During the past few years, scientists have read DNA from one-million-year-old mammoth teeth and from found evidence for horseshoe crabs and mastodons in two-million-year-old environmental DNA. ‘DNA lasts a very, very long time, especially if it’s stored without oxygen and water and in the dark,’ says Emily Leproust, chief executive officer at Twist Bioscience in San Francisco, US.

It is also extremely likely that future generations will retain the ability to read DNA. ‘DNA is so important to human health that you’ll always be able to read DNA. In 100 years, maybe we won’t use Illumina or PacBio anymore, it will be different sequencing technologies, but we’ll always be able to sequence it,’ Leproust says.

Then there is the issue of density and energy consumption. Data stored ‘in the cloud’ is actually held in huge data centres scattered across the globe. The data centres on the Cardiff Data Center Campus, for instance, have a footprint of about 140,000m² and consume 270MW of power – enough to support a small town. By contrast, ‘DNA is extremely dense,’ says Leproust, ‘you can put dozens of data centres in the size of a sugar cube.’ It requires no energy to store it and will last for thousands of years in a sealed container, as long as it is desiccated and kept in a reasonably cool location.

A time capsule

Before DNA data storage can become mainstream, two major hurdles must be overcome: the cost of DNA synthesis and sequencing needs to come down and the speed needs to go up. Significant effort is ongoing towards achieving these goals. It is, however, unlikely that DNA data storage will ever be cheap or fast enough to replace electronic data storage wholesale. Instead, it is expected to fill niche gaps in the market for purposes such as archiving data that needs to be stored for a longer period of time without needing to be read very often. Data in this category includes culturally important data, legal documents and vital governmental information. ‘I’ve been talking with the National Archives here in the UK and with the British Library,’ says Thomas Heinis, a reader in computing at Imperial College London, UK.

Because we use less regents, it’s cheaper

Phosphoramidite chemistry is the predominant approach used today to synthesise DNA in the lab. Synthetic DNA is stitched together one nucleotide at a time by forming covalent bonds between the 3′ phosphite ester groups and the 5′ hydroxyl groups on adjacent deoxyribose sugar units. Because the nucleotides are added one at a time and each addition requires protection and deprotection steps, building synthetic DNA is a laborious and expensive process.

Miniaturising DNA synthesis is one way to reduce costs. DNA is typically made in 96-well plates with one piece of DNA made in each well. Twist Bioscience has developed an ink-jet printer-based platform that builds 1 million pieces of DNA simultaneously. It uses silicon chips (of the type used by the semiconductor industry) that are micropatterned with tiny wells in which the chemistry takes place. This platform ‘uses 99.8% less chemicals’ for each piece of DNA built, says Leproust. ‘Because we use less regents, it’s cheaper,’ she adds. Twist Bioscience’s technology is already used to make bespoke synthetic DNA strands for vaccine development, drug discovery, diagnostic design and other biotech applications. The early access of its services for DNA data storage is planned for 2025, Leproust says.

An error-filled future

Another approach being used to cut synthesis costs (and also speed up sequencing) is the use of error correction codes. These extra pieces of DNA correct for any errors so that the information can still be read. Data strings in electronic storage technologies also contain redundant code, which can be used to correct any errors if something goes wrong. Being able to correct errors in the data read out at the end of the process opens the door to less accurate but cheaper and faster synthesis and sequencing tools being used. ‘We can do things at the encoding level, in the DNA information itself, to handle errors,’ explains Jeff Nivala, an assistant professor of computer science and engineering at the University of Washington. ‘I can then deal with a very high error rate with my [synthesis or] sequencing device as it’s easy for me to correct for that.’ Error correction codes can also handle error introduced during storage. CDs with light scratches on their surface can still be played, for example, thanks to error correction codes.

Encoding scheme for storing 100 kB of music data in 16,383 DNA sequences

Source: © Antkowiak et al. Nat. Commun. 2020

A variety of techniques can be used to correct for potential errors

Lower-accuracy DNA synthesis methods being explored for data storage include massively parallel light-directed synthesis with added error correction codes. Grass and collaborators including Mark Somoza, a professor of chemistry at the University of Vienna in Austria, are pioneering this approach. ‘We can synthesise roughly 2 million sequences in parallel,’ Somoza explains. The process removes the protecting groups on the 5′ hydroxyl using UV light in a flow cell system into which the necessary reagents for each step are added cyclically. ‘There is [normally] an acid label protecting group on the 5’ site and we replaced that with a photolabel group,’ says Somoza. During the deprotection, an array of micromirrors is used to direct UV light very precisely onto DNA’s surface. The rest of the chemistry used is very similar to traditional DNA synthesis approaches. Light-directed synthesis is significantly cheaper and faster than conventional DNA synthesis. Using this approach, the team has demonstrated flawless data recovery of a file containing sheet music by Mozart.

AI to the rescue

Olgica Milenkovic, a professor of data processing at the University of Illinois at Urbana–Champaign in the US is exploring another approach to handling errors: artificial intelligence (AI). ‘Synthetic DNA is so expensive [that] using error correction coding may add a large overhead,’ she explains. ‘We use the collection of [already developed] techniques for machine learning and artificial intelligence to make the images encoded in DNA look better in the presence of errors, rather than to try to fix the errors.’ This approach isn’t suitable for data that needs to be highly accurate, but it does work well for images where AI tools already exist to ‘fix’ damages to old photographs so that they are no longer visible to the naked eye.

Phosphoramidite chemistry is very dirty, very toxic, very hard and expensive

Milenkovic has also developed a different writing approach with the information stored in the DNA in two ways. The image information is placed in nucleotide patterns in the synthetic DNA using traditional DNA synthesis techniques. Copyright details and watermarks are then added as a pattern of nicks in the DNA backbone. These are a binary code and made by nicking enzymes. ‘If you have a nick it means it’s a one, if you have no nick it means it’s a zero,’ Milenkovic explains. Using two layers of coding means more information can be stored in the same space. Milenkovic and her group have used their approach to store and reproduce eight Marlon Brando movie stills.

Enzymatic synthesis is also being explored for building synthetic DNA. This technology is less well developed than phosphoramidite chemistry but has the potential to provide a faster and cheaper alternative that doesn’t require toxic chemicals. ‘Phosphoramidite chemistry is very dirty, very toxic, very hard and expensive,’ says Ceze. ‘We need to truly home in on enzymatic synthesis [to make it] controllable and high throughput as well.’ Kern Systems, a spin-out from George Church’s Harvard lab, and France-based DNA Script are among the companies that are advancing enzymatic synthesis of DNA for data storage.

Speed reading

Sequencing by synthesis methods such as Illumina’s sequencing platforms are the current gold standard for reading back data stored in DNA. Nanopore sequencing is gaining traction due to its ability to sequence single molecules of DNA without the need for amplification. These devices use molecular motors to ratchet DNA strands through a pore in a polymer membrane that contains a detector. The ions in the surrounding solution flow through the pore generating an electric current and as each base passes through the pore it creates a different (measurable) distortion of this current. The technology is ‘a really great way to [achieve] highly accurate sequencing’, explains Nivala.

When we do a DNA data storage experiment at the moment, we need lots of PhD students pipetting stuff

Neither sequencing by synthesis or today’s nanopore sequencing devices are fast enough for DNA data storage applications. Oxford Nanopore’s commercial devices, for example, have a maximum speed of 400 bases per second. Work to develop faster and cheaper nanopore sequencers is ongoing in academic and commercial labs around the globe. ‘If we can do away with these molecular motors and use electrophoretic power or the voltage power within the device itself, you can push these DNA strands through these nanopores orders of magnitude faster,’ Nivala adds. Compromising sequencing accuracy will also allow the price per gigabyte of data retrieved to go down. For context, Oxford Nanopore’s disposable dongle-sized capsules (the Flongle) cost $90 (£71) each and can sequence up to 2.6Gb of data in 16 hours. This is about the amount of data needed to store a film like Star Wars: The Last Jedi in standard definition.

Automation of the entire storage process is another area of focus. The synthesis and sequencing technologies are now mostly automated, but the intermediate steps are still largely manual. ‘When we do a DNA data storage experiment, I have lots of PhD students moving around the lab pipetting stuff,’ Grass explains.

It will be necessary to fully automate the write-to-store-to-read cycle for DNA data storage to become mainstream and be viable for applications beyond archival data that is rarely accessed, wrote Strauss and Ceze in an article in 2019 that outlined an automated end-to-end example of DNA data storage. Their benchtop-size set up first converts the data (with additional error correction code) from zeroes and ones into As, Ts, Cs and Gs. These bases are then pumped in order onto a column where they are stitched together using phosphoramidite chemistry. Once the strands are complete, they are washed off the solid supports on the column and into a storage bottle. To retrieve the data, the liquid is pumped into an Oxford Nanopore’s MinION device where the DNA is sequenced. Finally, this code of As, Ts, Cs and Gs is decoded back into zeroes and ones. In their first demonstration, the scientists sent the word ‘hello’ in 21 hours. ‘Although we have demonstrated that it is possible to fully automate an end-to-end DNA data storage system at low cost, that system was not high throughput,’ says Strauss. Work is ongoing to scale up and speed up this automated approach.

There is clearly still much work to be done before DNA data storage becomes mainstream. Those working in the field believe that it’s only a matter of time before some of our massive energy hungry data centres will start to be replaced by tiny capsules of DNA that our ancestors in thousands of years’ time will be able to access. And that poses a question: what message would you most like to leave those that follow in your footsteps?

Nina Notman is a science writer based in Salisbury, UK