A long time ago I was on the edge of the debate about SNPs in the late 1990s; of whether there should be an investment in first discovering, and then characterising and then using many many biallelic markers to track down diseases. As is now obvious, this decision was taken (First the SNP Consortium, then the HapMap project and its successor, 1000 genomes, and then many Genome wide association studies). I was quite young at the time (in my mid to late twenties; I even had a earring at the start of this as I was a young, rebellious man) and came from a background of sequence analysis – so it was quite confusing I remember getting my head around all the different terminology and subtlies of the argument. I think it was Lon Cardon who patiently explained to me yet again the concepts and he finished by saying that the real headache was that there were just so many low frequency alleles that were going to be hidden and that was going to be a missed opportunity. I nodded at the time, adding yet one more confused concept in my head to discussions about r2 behaviours, population structure, effect size and recombination hotspots all of which didn’t sit totally comfortably in my head at the time.
That debate is worth someone trying to reconstruct and write up (I wonder if those meetings are recorded somewhere) as in fact, as in many scientific debates, everyone was right at some level. For the proponents of the SNP approach, it definitely “worked” – statistically strong reproducible loci were found for many diseases. Although these days people complain about the lack of predictions from GWAS, at the time the concern was not whether there would be some missing heritability issue, but (as I remember) about whether it would work at all. It did, and in spades – just open an issue of Nature genetics. However for the people who were cautioning that there would be alot more complexity to disease – allelic hetreogenity, complex relationships between SNPs (both locally and globally) and then this curse of allele frequencies, let alone anything more complex, such as gene/environment, parent-of-origin or even epigenetic trans-generational inherietance (I list these in the rough order of my own assessment of impact; feel free to order to taste), they also are definitely proved right by our current scenario.
Remembering that young man in his mid twenties, confused by all the terms spinning around each of these pieces of complexity deserves unpicking. Allelic hetreogenity is when a locus is involved in a disease (for example, the LDL Receptor – the LDLR gene – with Familial hypercholesterolaemia), but there are many different alleles (a different mutation) often with different effects involved in the disease. This means the disease is definitely genetic; that a particular gene is definitely involved; but that no particular SNP is found at a high level associated with the disease as there are 100s or so different (probably) causitive alleles. The complex relationship between SNPs, epistasis, is both at a local (“haplotype”) level where there might be a particular combination that’s critical or globally. A good example of this local complexity is the study by Daniel MacArthur and colleagues where they found that a number of apparent frameshift mutations, predicted to be null alleles, were “corrected” back in frame, making (in effect) a series of protein substitution changes presumably with a far milder, if any, effect. If you try to model each variant alone here one makes very different inferences from modelling the haplotype; in theory one should try to model the whole, global genotype.
And only recently have I really come to appreciate the headaches that Lon was trying to explain to me around allele frequency. One of the early and robust predictions of population genetics, which is pretty obvious when you think about it, is that one expects an exponential decay of alleles compared to frequency in the population – ie, lots, lots more rare alleles than common alleles. This is because when a mutation happens, it must start at a ratio of 1 to “the whole size of the population” and can only grow bigger generation by generation. If the allele doesn’t effect anything you can model this process very elegantly as a random walk. For starters this random walk tends to stay pretty low frequencies just because it is random, and in fact the most likely thing is that it randomly dissappears from the population. Now if this allele has a deleterious effect – which is basically what we expect for disease associated alleles – then it is even more likely to stay at a low frequency. I visualise this as the genome having a sort of series of little bubbles (variants) coming off them, and these bubbles nearly always popping straight away (variant going to zero); only rarely does a bubble get big (grow in frequency in the population). A disease effect is always pushing those bubbles associated with disease to be smaller. And – often – you can’t even see the small bubbles. At the limit, every loci will have complex allele hetreogenity; the only question is how big – in both frequencies and in effect – are some of the alleles.
Having appreciated this at a far deeper level now (partly from looking at a lot more data myself) I am now even more impressed that GWAS works. For GWAS one not only needs to have a variant tagging your disease variant, but that’s got to be at a reasonable enough frequency to detect something statistically – one or two individuals will not cut it. This is one of the big drivers for the large sample sizes in genome wide association studies – large sample sizes are needed just to capture enough of the minor allele of rare variants – and remember that the majority of variation is in this “rare” scenario.
But the other place where we can improve our ability to understand things was illustrated by a talk by Samuli Ripatti, working with other colleagues worldwide (including my new collaborator, Nicole Soranzo) on lipid measurements. They took a far larger set of lipid measurements than is normally done in a clinical setting with a alphabet soup of HDL and LDL sub types, along with all sort of amino acids. From this not only did they recapitulate all the existing HDL and LDL associations, but very often the specific subtypes of LDL or HDL showed far stronger effects than the composite measurements. At some sense this is no surprise – the closer one gets to measuring a biological end point of genes, the bigger effect you will see from variants, whereas more composite measurements must have more sources of variants by their very nature. And this is where all the molecular measurement techniques of chip-seq, RNA-seq, etc (exploited and explored in projects like ENCODE and others) is going to be very interesting, though we wont be able to do everything on every cell type.
So – the moral of this story is two fold. Firstly we will need large sample sizes to understand the full set of genetic effects – despite many people telling me this over the last three or four years, it only really “clicked” in my head in the last 6 months. Secondly we need to raise our (collective) game in phenotyping, and not just molecular phenotyping, or cellular, or endo, or disease – but all types of phenotyping, as the closer we can get to the genotype from the phenotype end, the better powered we are.
And many, many groups worldwide are getting stuck into this, telling me that we have at least another decade’s worth of discovery coming from relatively “straightforward” (in concepts, though not in practice, logistics, sequencing or analysis!) human genetics.