The 10,000 year archive

The task: store a substantial amount of digital information for a future civilization to access
DNA has a good chance of lasting  10 000 (or more) years so long as long as it is kept cold, dark and dry. And of course, DNA is incredibly dense: at least 1 petabyte can be stored in 1 gram of DNA, and that includes a lot of built-in redundancy. It’s a very good information storage molecule, and Nature has been pretty clever in choosing it.
Ten thousand years takes us way back into human prehistory, before the earliest recorded writing appears on the scene (around 6000 BC to 3000 BC). If we could use DNA to create a digital record that’s good for 10 000 years, I’d say that should take care of most of our archiving needs. 
First, a patron
So imagine some billionaire – let’s call her Patricia – who wants her name to live on until the end of time. She is a regular on Sir Richard’s space shuttle, likes to fund new vaccines and lives in a House of the Future™. As her parting gift to the world, Patricia would like to capture much of the wisdom, culture as well as the follies and foibles of modern civilization and preserve them in perpetuity. 
There are groups – such as the The Long Now group that think explicitly about this – perhaps this is as simple as handing the technology over to such a group – .  But Patricia didn’t get her billions without reading the business plan, so let’s think through the details, just for fun (and, of course, for Pat).
The right stuff
We know we can’t just take a Nature paper, a tar-ball of Wikipedia and an Agilent synthesis machine and preserve all of history and contemporary culture in a minivan-sized piece of DNA for all time. We need to tackle quite a few practical issues first.
What to use for the storage? Glass? Metal? Ceramics? Or what? And should the DNA be double-stranded or single? Perhaps  we should try to mimic the environments around known, recoverable ancient DNA? There is plenty of research available on synthesis and storage, but details that don’t matter on a timescale of a few years might make a huge difference over 10 000 years. It is a bit tricky to test something out on that kind of timescale; Pat won’t be having it.
Probably we should work out how to simulate 5000- or 10 000-year damage (e.g. from increasing cosmic rays) using the 10 000-year-old Cave Bear sequence as a comparator. (An aside: we’d need to sequence some of the current Bear species in order to be able to get a good  ancestral Cave Bear sequence – Cave Bears are extinct – if we want to get an accurate view of what ancient DNA error profiles look like in detail)  
This is actually a well-explored area for scientists involved in ancient DNA. Although one could never be sure of the precise physical storage method, I’m pretty confident we’d have a “reasonably good” approach.
Location, location location
Where are we going to put the archive? How can we be sure that future civilizations will know that it’s there, if it gets buried? How can we be sure someone won’t just drill right through it, bulldoze it for a new spacepad? (Nuclear waste storage researchers have been grappling with these questions for a while, so we should probably ask them for some input.)
If we can make one archive, we can probably make six. We could distribute them in a few places, same as we do with our data centres now. Let’s pick some cold, dark, dry mountain areas that aren’t going to melt and slide off into a pile of rubble, and which have no immediate geological plans to relocate. 
All we need is for one of these mountains to stand the test of time, and for a few lucky things to happen so the archive can be discovered. It should work; after all, one of the best-preserved bits of ancient DNA (the tip of the little finger from a Denisovan) was found in a cave in the Urals.
Read the instruction manual – in Maths
Now for the fun part: designing the bootstrapping procedure. Someone 5000 or  10 000 years from now is going to unearth our precious archive, and deserves a reasonable chance of retrieving and understanding what’s in it. 
We can’t just have the DNA minivan lined with instructions in English and Chinese – languages are too fluid for us to bank on their being intact thousands of years in the future. What’s more, there could well be a scenario in which all of history has been lost, and our explorer (let’s call him Joe-Ug) might not have the cultural references needed to deal with a set of ancient languages.
We’ll need to explain the DNA storage technique well enough that Joe-Ug and his friends can output a bitstream that describes the system in a universally comprehensible way. Later on, we’ll also need to provide the necessary cross-referencing language so that they can interpret the full archive. But first, we need to make a diagram to help them translate the DNA to a bitstream.
What’s DNA, anyway?
We will need to imprint the diagram on something very robust – a material that can remain intact for thousands of years. We should probably go with Nickel, which the Long Now group has settled on as a good idea – or gold as they did with the Pioneer/Voyager disks (might be too expensive though – not sure how many billions Pat made). To make sure we get the process right we will need to consult with material scientists – a well populated field.
For the diagram itself, we have to show the molecular structure of DNA – there is no guarantee that people will even know what that is anymore. We can use symbols alongside the relevant number of protons and neutrons – for example C6 for carbon and N7 for nitrogen. (Joe-Ug had better have clever friends. We should probably throw in a periodic table of the elements as well.)
The next part of the diagram could be an atomic diagram of the four DNA monomers mapped to the symbols. We’ll need to describe the codec, probably writing out a complete example in long hand; a binary message, the resulting theoretical complete DNA string, every fragment we synthised etc. 
Testing, testing, one, two…
But – we need to be careful here; if one detail is left out that we have thought about then a future scholar will be hosed. So it’s going to be important to provide lots of cross referencing information – rather like the message sent in Contact, with considerable extra detail.
If we have enough space (and why not?) we can do the same. How about writing out the first 100 numbers in binary, and circling the prime numbers? If Joe-Ug and his friends have any mathematical nous, they’ll be able to see that they are on the right track. Internal checks on the chemical symbols might include the orbital structure of electrons, and perhaps also highlighting Nickel (or whatever material we choose) in the periodic table? The more cross references we provide, the harder it will be to leave out a key concept. 
We would of course consult experts in pedagogy, but it seems reasonable to include a series of test DNA phials for future explorers to decode. These phials could contain multiple copies of the same message, and would be nestled securely in the engraving. Instructions would include the input message, the longer encoded message as DNA and the short fragments in long hand, such that every decoding step could be double checked physically and mathematically. 
I would include four or five test messages, some longer than others. 
Testing, testing…
It would be fun to test this. We could take a bunch of enthusiastic students who haven’t read the study in detail, give them the Nickel diagrams (perhaps on paper for the text) and so forth – sneaking in non-current symbols in all positions – and a few phials of the DNA. 
We’d set up a competition between groups. They could suggest experiments (e.g., can we put the material in the phial into a mass spec machine?) and hopefully will ultimately work out that it’s DNA in there, and asked for it to be sequenced. Patricia would bestow the prize, of course.
Don’t forget the Rosetta Stone
So, that’s the first task of bootstrapping sorted. The second bit is more complex, as it involves creating a sort of Rosetta Stone for future civilizations that may lack any good knowledge of current languages or notation – but hopefully they still have some records of some ancient texts (ancient greek was the key to using the Rosetta stone to understand Egyptian hieroglyphics). Following that model, we’d need to create the same message in multiple languages (the Rosetta Stone had three – but there is no need to be limited by that). Note we would need to have the symbols to bits (UniCode mapping) engraved somewhere.
The archive could include, say, the UN declaration of human rights, or Magna Carta, or the first chapter of the Art of War in multiple languages, with careful mapping of the words. Since we’re not too worried about the size of the archive anymore, it shouldn’t be such a big deal to overlap several texts digitally. 
I’m also going to vote for having some key messages re-encoded in modern languages every century or so, though I’d be surprised if this happened more than 10 times (and even that would be a stretch for modern civilisation) – hopefully Pat can set up some sort of endownment. If it worked, it would increase the chances that Joe-Ug’s clever friends will be able to find a starting point.
UNIX, naturally
Now to the structure of the archive itself: specifically, the layout of the files, directories and other formats. I think it’s safe to say that thousands of years from now people may need a description of the UNIX file system structure conventions, tar, and image and audio formats. Since one can, in theory, describe images and sounds mathematically without reference to a language, image and audio formats can be specialised aspects of the more generic bitstream process.
Wrapping up
I hope Patricia would be satisfied with our plan, which can be described in three parts: DNA to bitstream, language bootstrap and final archive. 
I visualise this rather like the tombs of ancient Egypt, with strong symbolism that you need to resolve before moving from one room to the next. I imagine Joe-Ug, 45 000 years from now, stepping through these dusty rooms and being confronted with all this ancient symbolism on some painstakingly engraved metal. I see him bringing in his clever friend, who invites his clever friends, and then a decade or so of multinational (assuming they have nations) efforts to decode the panels. Then, the big breakthrough that takes them into the second room and, finally, the delight and scholarship of future generations reading texts – some funny, some sad, some ingenious – from the distant past.
Who knows? Maybe this blog post will make it into the archive (I’d better keep Pat happy) and it will make a future scholar smile.


9 thoughts on “The 10,000 year archive

  1. Interesting post. I’ve heard of data stored on glass or on pottery, but it’s quite an interesting thought that data could be stored on “a speck of dust.” It’s really mind boggling stuff but is it even plausible to extract such huge amount of data from DNA, let alone store it for long periods? My, the possibilities that one could do with that. Really made my imagination and fascination run.

    Ruby Badcoe

  2. So DNA can essentially be used as the perfect medium for data storage. Interesting. Just a thought though… is it possible that there might already be retrievable data already stored in human DNA? Something that could be audio text or even video already there?

  3. So DNA can essentially be used as the perfect medium for data storage. Interesting. Just a thought though… is it possible that there might already be retrievable data already stored in human DNA? Something that could be audio text or even video already there?

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s