Last week a positioning paper by Guy Cochrane, Chuck Cook and myself finally came out in Gigascience. It’s premise is rather simple: we are all going to have to get use to lossy compression of DNA sequence, and as lossy compression is variable (you can set it at a variety of levels), we will have to have a community consensus of how much one compresses different data. This really part of our 2 year process into making efficient compression a practical reality for DNA sequencing, which I’ve blogged on before, numerous times.
I encourage you to read the paper, but in this blog I want to explore more the analogies between imaging and DNA sequencing – which now are numerous. I believe at the core biology (of all sorts) the majority of data gathering will either be optical or “nucleic”. (the third machine type probably being mass spectroscopy, if you are interested). As a colleague once said about molecular biology – the game in town is now to get your method to output a set of DNA molecules. If you can do that – you can scale.
The first question is to ask why these two technologies are so dominant. The first is that one is fundamentally trying to capture information – information about distributions of photons or information about the make up of molecules. One is not trying to do something large and physical. This means that the mechanisms for detection can be excruiatingly sensitive. In the case of imaging, single photon sensitivity is almost routine on modern instruments, and with things like Super-resolution imaging, which is a whole bunch of tricks to (in effect) in effect convert multiple time-separated images of a static image into better spatial resolution (I saw a deeply impressive talk by a Post Doc in Jan Ellenberg’s group showing a remarkable resolution of the nuclear pore). But one does not need to have fancy tricks to make great use of imaging – rather mundane, cheap imaging is the mainstay of all sorts of molecular biology – drosophila embryo development would be a different world without it. In the case of DNA, the ability to sequence at scale has been the focus for the last 20 years, and still will remain so probably until a human genome is in the ~$1,000 or $100 zone – as expensive a serious Xray or MRI. But at the same time the other shift is to moving towards more real time systems (the “latency” of the big sequencers are probably the biggest drag – 3 weeks is your best case on the old HiSeqs) and to single molecule systems. People talk about real time as critical to the clinic, and certainly the difference between even 12 hours and 2 weeks is day and night (at 12 hours or less one can do the cycle within a 1 day stay, and start to impact in-time diagnosis), but faster cycle times will really change research as well. Going back to the information aspect of these two technologies, as one is trying to only get information out of these things the physical limits of the technologies are remarkably far away. Imaging is hitting some of these limits (though there is still plenty of space for innovation); 3rd generation DNA sequencers will get closer to some limits in DNA processing as well, but again, we have some way to go. The future is bright for both of these technologies.
The second similarity is just the mundane business of storage of the output of these technologies – they are high density information streams, and therefore have alot of inherent entropy – some of that entropy one wants to utilise – that’s the whole point – but there is also quite a bit of extra “field” (in imaging terms) or “other bits of the genome” (in DNA terms) which one often knows is going to be there, but is less interested in. Imaging has long led the area of data-specific compression, using at first a variety of techniques of transformation of the data from straightforward x,y layout of pixel intensities to ways which inherently capture the correlation between pixels, allowing for efficient lossless compression. But the real breakthroughs came with lossy compression, understanding that for alot of the pixels, a transformation which lost some information for a large gain in compressability where appropriate for uses. Although the tendancy is to think about lossy compression in terms of “visual” display-to-user uses, in fact many technical groups use a variety of lossy forms for their storage, choosing mainly the amount of loss appropriately (I’d be interested in experiences on this, and in particular whether people deliberately choose other lossy algorithms away from the JPEG family). But Video compression has really taken lossy compression into new directions, with complex between frame transformations and then lossy applications, in particular adaptive modelling.
When we started in DNA compression many people critiqued it that we “couldn’t beat established generic compression” or that certain compression forms we “already optimal”. This totally misses the point – generic and optimal compression schemes are only generic and optimal for a particular data model, and to be generic, that data model involves a byte-stream. One doesn’t hear people saying about video compression “oh well, that problem has been solved generically and there are optimal compression methods” – putting a set of raw TIFFs straight into a byte-based compressor would not do very well. The key thing is first a data transformation that makes explicit correlation in the data for standard generic methods to compress (in the case of DNA, reference based alignment provides a sensible realisation of the redundancy between bases in a high coverage sample, and for a low coverage sample realises the redundancy with respect to previous samples). The second thing is the insight that not all the precision of the data is needed for interpretation. Interestingly lossy compression makes you think about the problem as the inverse of the normal thought process – often you ask “what information am I looking for” for some biological process – SNPs, or structural variants. Lossy compression methods inverts to the problem to ask “what information are you pretty sure you don’t need”. For example, when you know your photon detector will generate some random noise in particular patterns, having a lossy compression remove that entropy is highly unlikely to effect downstream analysis. Similarly when we can confidently recognise an isolated sequencing error, degrading the entropy of the quality score of the base is unlikely to change downstream analysis. I’ve enjoyed learning more about image compression, and I think we’ve only started in DNA compression – at the moment we can 2 to 4 fold compression compared to standard methods with a clearly acceptable lossy mode (acceptable because the machine manufactures sort of know that they are generating a little too much precision in their quality scores). But with more aggressive techniques we can already think about 50 to 100 fold compression – though this is getting quite lossy. But this is not the end of the road here – I reckon we could be at 1,000 fold more compressed in the future.
The third similarity is the intensity in informatics in the processing. Both for image analysis and DNA analysis there are some standard tools (segmentation, hull finding, texture characterisation in imaging; alignment, assembly, variant calling in DNA sequence analysis) but how these tools are deployed is very experiment specific. There is not some “generic image analysis pipeline” any more than there is a “generic DNA analysis pipeline”. One has to choose particular analysis routes mainly driven by the experiment that was performed, and then to some extent for the output you want to see. This means that the bioinformatician must have a good mastery of the techniques. I have to admit, although I live and breathe DNA analysis, often developing new tools, I am pretty naive about image analysis – not that that’s stopping me diving in with my students in using (but not developing…) image analysis. I think we’re not making image analysis enough of a mainstream skill set in bioinformatics, and this needs to change.
Finally the cheapness and ubiquity of imaging has meant that from the start image based techniques had to think carefully about which images one would store and at what compression. Clearly DNA sequencing is heading the same way, and this is the paper that Guy and myself put forward. Similarly to imaging, the key question is what is the overall cost of replacing the experiment, not the details of how much the image itself cost. So – a rare sample (such as a Neanderthal bone) is very hard to repeat the experiment – you need to store that information at high fidelity. But a routine mouse sequencing chip-seq is far more reproducible and one can be far more aggressive on compression. I actually think it has been to the detriment of biological imaging that there has not be a good, reference archive – probably because of this problem is knowing which things it is worth archiving coupled with the awesome diversity of uses for imaging – but projects like EuroBioImaging I think will provide the first (in this case federated) archiving process.
Over the next decade then I see ‘imaging’ and ‘dna sequencing’ converging more and more. Time to learn some image analysis basics (does anyone know a good book on the topic that geeky and detailed but starts at the basics?)