Today sees the publication in Nature of “Toward practical high-capacity low-maintenance storage of digital information in synthesised DNA,” a paper spearheaded by my colleague Nick Goldman and in which I played a major part, in particular in the germination of the idea.
This is one of the moments in science that I love: an idea over a beer that ends up as a letter to Nature.
Preserving the scientific record
About three years ago, I was spending a lot of time working out how the EBI would cope with the onslaught of DNA data, which has been growing at exponential rates. Worryingly, that rate has been much greater than the (also exponential) growth of disk space. Without a solution to this problem, our DNA archive could become unsustainable.
An immediately practicable solution we developed was efficient, data-specific compression – something I’ve blogged about extensively. But in the long run, more dramatic measures might be needed to sustain the life science archives. Which got us thinking…
Where would science be without a pub?
At the end of a particularly long day, Nick and I were having a beer and talking about the need for dense, low-cost storage. We joked that of course the densest, most stable way to store digital information was in DNA, so we should just store sequenced DNA information in … DNA. We realised that this would only work if the method could handle arbitrary digital information – but was it possible? Still in a playful frame of mind, we got another drink and started to sketch out a solution.
There are two challenges when trying to use DNA as a general digital medium: First, you can only synthesise small fragments of DNA. We realised that the task of making a long “idealised” string of DNA from fragments is just a question of getting the assembly right. Synthesising DNA to our design would make this problem trivial – in fact, it doesn’t actually have to be an assembly “problem” – you can give each fragment an index (encoded as bases) that provides instructions for how to re-assemble it with the surrounding fragments.
The second challenge is that both writing (synthesis) and reading (sequencing) DNA are prone to errors, particularly when there are long stretches of repeating letters. Creating an error-tolerant design is absolutely essential. There are a lot of error-correcting codes available in signal-processing technologies, but we needed one that could handle common DNA errors (homopolymers cause a real headache for both synthesis and sequencing).
We realised that we could fairly easily create a codec (jargon, stands for “code–decode”) that guarantees the elimination of homopolymers. We were also aware that that synthesis (writing) errors were going to be far more damaging that sequencing (reading) errors, as a writing error was more likely to effect a large proportion of the molecules of a particular design.
So the codec we developed involves translating everything in to base 3, and has a transition rule that generates base 3 numbers from each base. Each base has four different synthesised designs: these occur in a staggered, tile-path fashion, as we would be generating millions of molecules per design. As an extra precaution against errors, we made sure that each fragment was going in a different direction (strand), because when things go wrong it’s often in a strand-specific manner.
The next round
Another beer. A bit more serious now, we were determined to be very “belt and braces” about our new code. We called for more napkins and a new pen and tried to see how far we could push the idea. “Why not do it for real?” one of us asked. “Because it’s too expensive,” the other replied, naturally.
So in the bright light of day, we looked for an efficient (read: cheap) synthesis mechanism, and managed to talk to the research group at Aglient in California, headed up by Emily Proust. Excited by the idea, Emily’s group agreed to let us use their latest synthesis machine for this project and asked us to send us some stuff to store.
What to pick? We wanted to show off the fact that our codec could be used on anything at all, so we picked the following items to send over to Agilent:
- a .jpg file of a photograph of the EBI
- an .mp3 file of a portion of Martin Luther King’s speech, “I have a dream”
- an .ascii file of all 157 of Shakespeare’s sonnets
- a .pdf file of the Watson and Crick paper describing a DNA sequence
- a self-referential pseudo-code of the codec used for the DNA encoding.
Nick made the DNA designs. To double check, he simulated reads and tested to make sure he could recover all the files (all went fine). Then he checked again. Then he ftp’ed it all to Agilent.
Data in a speck of dust
About a month later a small box arrived at the EBI with six tubes. Nick – being a mathematician – had to be persuaded there was actually stuff in the tubes (DNA is very, very small). I assured him that the little speck in the tube must be bona fide DNA. He remained sceptical.
We brought in some experimental colleagues, including Paul Bertone, who helped sequence the speck. We even have a picture of Nick in a lab coat, pipetting (shocking, believe me). We did manage to recover the actual DNA sequence for all six files: five with absolutely no trouble at all, and the last with one “gotcha”. We didn’t fully think everything through (despite all our pains) and a tiny amount of data was missing – but we were able to recover the entire file with some detective work (check out the supplementary information).
We had done it! We had encoded arbitrary digital information in DNA, manufactured it, and read it back. But we had to wonder whether our result was actually useful. DNA synthesis at this scale is still more of a research endeavour: volumes are going up but the price is still very high (certainly much higher than hard disk or tape).
If you can read wooly mammoth DNA…
DNA has a wonderful property: it can be stored stably without electricity, and needs no management beyond keeping it cold and dark. It is remarkably dense, even with the rather insane over-production of molecules (we calculated that we could have easily have gotten away with using a tenth of the DNA). Given all the design redundancy, we calculated that one gram of DNA would (easily) store one petabyte of information.
We wrote a letter to Nature describing our codec and exploring some of the salient characteristics of DNA as a storage medium. Our encoding scheme can actually store up to a zetabyte of information, although given current prices this would be prohibitively expensive. More interestingly, because it costs nothing to maintain DNA storage (beyond the physical security of a dark, dry and cold facility like the Global Seed Bank in Svaldberg) at some point, DNA storage becomes the cheaper option.
But can you afford it?
Using tape storage for comparison, we estimate that it is currently cheaper to store digital information as DNA only if you plan to store the information for a long time (in our model between 600 to 5000 years). If the cost of DNA synthesis falls by two orders of magnitude (which is kind of what happened over the past decade), it will become a sensible way to store data for the medium term (below 50 years). Which is not so mad.
There are some codas to this story. Zeitgeist will be zeitgeist, and since that fateful first beer a DNA-based digital storage method has been proposed by George Church and his colleagues. They used a similar indexing trick, but their method does not address error correction (indeed, they comment that most of their issues in errors were homopolymer runs). They submitted their paper around the same time as we did, and it was published as a Brevia in Science in 2012 (shucks – another one of those science moments).
The 10,000-year archive
Nick and I have one more thought experiment to play out. Could we build an archive that stored a substantial amount of information for a future civilization to read?