The first step was developing a routine way to determine the order of the chemicals in the DNA polymer: sequencing. Fred Sanger, a gifted scientist and the only person with two Noble prizes in the same field under his belt, developed dideoxy-sequencing (a.k.a. “Sanger sequencing”) at the LMB in the 1970s. His laboratory, along with neighbourghing LMB labs including Sydney Brenner’s, produced a new generation of scientists: John Sulston, Bart Barrell, Roger Staden and Alan Coulson, who forged ahead towards the seemingly unobtainable goal of sequencing whole organisms – with human in their sights. First, they did the different bacteriophages (see my First Genome of Christmas). Then, in the 1980s John Sulston and colleagues started on mapping then sequencing the worm (see the Second Genome of Christmas).
Of course this was not just a UK effort; many US scientists were involved in genomics. A scientist and technology developer , Lee Hood, looked at how to remove the radioactivity that came with Sanger sequencing, and created flourophore based terminators. These were far safer and, importantly, amenable to automation. This led to the ABI company’s production of automated sequencers, which featured a scanning laser-based readout. Back in the UK, Alec Jeffreys made a serendipitous discovery: microsatellites – highly variable regions in the human genome that provided easy-to-determine genetic markers. This led to the rise of forensic DNA typing (first done for a criminal case near Alec’s native Leicester to provide evidence in a double murder case). A group of enterprising geneticists in France, led by Jean Weissenbach, used these microsatellites to generate the first genome-wide genetic map, based around Mormon families in Utah, who had kept impeccable family records. Clinician scientists were starting to use genetics actively: the first genetic diseases to be characterised molecularly were a set of haemglobinopathies (blood disorders such as sickle cell anaemia). In these cases, the clinicans were lucky that it was easy to track the protein itself as a genetic marker. A landmark breakthrough, by Francis Collins and colleagues, was the cloning of the gene for cystic fibrosis, using only DNA-based “positional” techniques, without knowing the actual defective protein. This was, at last, a clear, practical application of genomics.
From 1985 through the first part of the 1990s, all of these technologies and uses of DNA were improving, and it became increasingly clear that it was at least possible to consider sequencing the entire genome. However, this was still more of a sheer cliff than a gentle slope to climb. The human genome has three billion letters, a million-fold larger than bacteriophages and 30 times larger than the worm. If the human genome was going to be tackled, it was going to take a substantial, coordinated effort. Debates raged about the best technologies and approaches, the right time to invest in production vs developing better technology, and who, worldwide, would do what.
By the mid 90s things had settled down. The step-by-step approach used in the worm was clearly going to succeed, and there was no reason not to see the same approach working in human. The approach of mapping first, then sequencing was also compatible with international coordination, whereby each chromosome could be worked on separately without people treading on each other’s toes. There was some jostling about which groups should do which chromosomes (the small ones were claimed first, unsurprisingly), and some grumbling about people reaching beyond their actual capacity, but it was all on track to deliver around 2010.
- The Sanger Centre (now the Sanger Institute), led by John Sulston with Jane Rogers and David Bentley as key scientists, funded by the Wellcome Trust, a UK charity;
- US Department of Energy (DOE)-funded groups around the Bay Area in California (now the Joint Genome Institute, JGI), with Rick Myers in the early stages and Eddy Rubin pulling the configuration together;
- Three US National Institutes of Health (NIH) centres, with oversight from Francis Collins, director of the NIH’s National Human Genome Research Institute:
- The Washington University genome center in St Louis, led by Bob Waterston with Richard Wilson and Elaine Mardis as key scientists (this was the Sanger’s sister group on the worm as well);
- Mathematician-turned-geneticist (and part time entrepreneur), Eric Lander, who formed the Whitehead Genome centre as part of MIT (now the Broad Institute);
- An Australian transplanted into Texas, Richard Gibbs, at the Baylor genome centre.
Then, the sequencing world was turned upside down.
Craig Venter, a scientist/businessman had been around the academic genomic world for sometime, and realised perhaps better than anyone else the potential impact of automation. He had already published the first whole-genome shotgun bacteria and, inspired by a paper from Gene Myers (a computer scientist working on text analysis, and converting to biology) realised that a similar approach could work on human. Craig assembled an excellent set of scientists – Gene Myers, Granger Sutton and Mark Adams among others – and persuaded leading technology company ABI to set up a new venture to sequence the human genome – privately. This was at the end of the 1990s, at the start of the dotcom boom when it was anyone’s guess what a viable business model would be. Certainly, holding a key piece of information for biomedical research 10 years before the public domain effort looked a pretty good bet. Celera was born, raised a substantial amount of money on the US stock market and purchased a massive fleet of sequencers and computers.
The academic project also responded to the new, higher-pressure timeline. Rather than keeping with the map-first, sequence second approach, people switched to sequence-and-map as one scheme, but still with mid-size pieces (BACs – around 100,000 letter regions) rather than reads (only 500 letters at a time). This was a half-way point towards whole-genome shotgun and, critically, allowed the five major centres to accelerate their production rate. The nice map with flags across the genome basically disappeared (though each chromosome would then be mapped and finished) and the five centres ploughed onwards, leaving footprints all over the nice, tidy, well-laid plan.
But this acceleration of rate caused another problem: bottlenecks in the downstream informatics. Celera started to crow a bit about their depth of human talent in computer science and the size of their computer farm. This became a real issue. The public project was facing a very real headache of having thousands of fragments of the genome without any real way to put them together. My supervisor, Richard Durbin, was the lead computational person at Sanger and stepped up along with other academic groups, notably the creative, enthusiastic computer scientist David Haussler in Santa Cruz. David and Richard had worked on and off on all sorts of things, bringing in parts of computer science methods into biology, and they – with us, their groups – began to try and crack this problem.
The first problem was assembly. Previously, we were guided by a “physical map” and assembly was effectively done by hand on a computer-based workbench. This needed to change. David was joined by ex-computer-gaming programmer Jim Kent, who felt he could do this. I remember discussing the details of assembly methods and concepts on a phone call, with Jim enthusiastically claiming it was doable and everyone agreeing that Jim should come to the Sanger Centre for a while to absorb the details of overlaps, dispersed repeats and other Sanger genome lore. He packed his bags and left that day, appearing 12 hours later in Hinxton: a jovial, very definitely west-coast Amercian, ready to get to work. Jim worked constantly for about six months (back in Santa Cruz) solid to create the “golden path assembler”, which provided the sequence for the public projects. Jim also created the UCSC Browser, which remains one of the premier ways to access the human genome (though of course I am partial to a different, leading browser…).
And it didn’t stop there. The public project and the private Celera project were now really swapping insults in public, and Celera said that even if the public project could assemble their genome, they wouldn’t be able to find the genes in this sequence. Thankfully, three of us – Michele Clamp, Tim Hubbard and myself – had already started a sort of ‘skunk-works’ project at Sanger to be able to automatically annotate the genome. The algorithmic core was a program I had written, GeneWise, which was accurate and error-tolerant but insanely computationally expensive. Tim had a (in-retrospect, bonkers) cascading file system to try to match the raw computation with the arrival of data in real time. Michele was the key integrator. She was able to take Tim’s raw computes, craft the right approximation (described as “Mini-seq”) and pass it into GeneWise. This started to work, and we made a website around it: the Ensembl project, which provided another way to look at the genome. (Mini-seqs and GeneWise still hum away in the middle of Ensembl gene builds, and are responsible for the majority of vertebrate and many other gene sets.)
Even more surreally for me, the corresponding Celera annotation project was also using GeneWise (I had released it open source, as I would do everything), so I would have a list of bugs and issues from Michele and Ensembl during the day, and then a list of bugs and issues from Mark Yandell and colleagues from Celera overnight. The friendliness and openness of the Celera scientists – Gene, Mark Adams and Mark Yandell – was at complete odds to the increasingly bitter public stance between the two groups.
It was an intense but fun time. Michele and I worked around the clock to provide a sensible model of the genome and features (using – radically at the time – an SQL backend), and there were constant improvements to how we computed, stored and displayed information. We’d often work all day, flat out, and then head back to Cambridge, often in Michele’s house where we’d snatch a quick bite and watch the latest set of compute jobs fan out across the new, shiny compute farm bought to beef up Ensembl’s computational muscle. Michele’s partner (now husband) James ran the high-end computers, so if anything went wrong, from system through algorithm to integration – one of us was on hand to fix it. As the first jobs came back successfully, we would slowly relax, and eventually reward ourselves with a gin and tonic as we continued to keep one eye on the compute farm.
Eventually it became clear that both projects were going to get there – pretty much – in a dead heat. Given that the public project’s data could be integrated into the private version, Celera switched data production efforts to mouse, much to Gene Myers’ annoyance as he wanted to show that he could make a clean, good assembly from a pure whole-genome shotgun. There was a brokering of a joint statement between Celera and the public project, and this led to a live announcement from the White House by Bill Clinton, flanked by Craig Venter (private) and Francis Collins (public), with a TV link to Tony Blair and John Sulston in the UK.
One figure in this announcement came from our work: the number of human genes in the genome. This is a fun story in itself – I can’t do justice to it now – involving wild over-estimation for over two decades followed by extensive soul-searching as the first human chromosomes came out. I ended up running a sweepstake for the number whereby, in effect, we showed that in the absence of good data, even 200 scientists can be completely wrong. For the press release, it was our job to come up with an estimate of the number of human genes, so Michele launched our best-recipe-at-the-time compute. Bugs were found and squashed, and I remember hanging around, providing coffee and chocolate to Michele as needed (there is no point really in trying to debug someone else’s code in a pressurised environment). Eventually an estimate popped out: around 26,000 protein-coding genes.
We looked at each other and shook our heads – clearly too low, we thought, and went into the global phone conference where the good and the great of genomics said “too low” as well. So we went back and calculated all sorts of other ways there could be more protein coding genes (after all, a biotech called Incyte had been selling access to 100,000 human genes for over five years). We ended up with the rather clumsy phrase, “We have strong evidence for around 25,000 protein-coding genes, and there may be up to 35,000.”
In retrospect, Michele and I would have been better sticking to our guns, and going with the data. In fact, we now know there are around 20,000 protein-coding genes (though there are enough complex edge cases not to have a final number, even today).
The human genome was done in a rush, with enthusiasm, twice, in both cases in such a complex way that no other genome would be done like this again. In fact, Gene Myers was right. Whole-genome shotgun was “pretty good” (though purists would always point out that if you wanted the whole thing, it wouldn’t be adequate). The public project, John Sulston above all, was right that this information was for all of humanity, and should not be controlled by any one organisation.
I was very lucky to be at the right place at the right time to be a part of this game-changing time for human biology. Crazy days.