Today sees the publication of the paper by Zhihao Ding, YunYun Ni, Sander Timmer and colleagues (including myself) on local sequence effects and different modes of X-chromosome association as revealed by the quantitative genetics of CTCF binding. This paper represents the joint work of three main groups: Richard Durbin’s at the Sanger Institute, Vishy Iyer’s at U. Texas, Austin and my own at EMBL-EBI. I’m delighted that this work from Zhihao, YunYun and Sander (the three co-first authors) that it’s finally come out, and want to share some aspects of the work that were particularly interesting to me.
Stepping back into quantitative genetics
This is the second time I’ve shifted research direction. Even though it’s still broadly in the area of bioinformatics, quantitative genetics is a discipline I hadn’t explored very deeply before embarking on this paper. Quantitative genetics is a very old field – arguably one that lit the spark that set off molecular biology as a science – and the birthplace of statistical methods for life science.
|RNA Fish for X chromosome inactivation|
Legends of frequentist statistics – Pearson (of Pearson’s correlation) and Fisher (of Fisher’s exact test) were motivated by biological problems around variation and genetics starting in the 1890s through the 1930s/1940s. This was also the time of the rediscovery of Mendel, and the fusing of Mendel’s laws with evolution and a whole host of fundamental discoveries: for example, Morgan’s discovery that chromosomes carry the units of hereditary information. My personal favourite is one on X-Y crossovers discovered in my favourite fish, Medaka in 1921, highlighted by a great editor’s comment. Reading some of these papers can be spine tingling. These scientists did not know about DNA, DNA polymorphisms as we understand them, or genes as RNA transcription units. Still, they could figure out many features of how genetics effected traits.
Quantitative genetics fell out of fashion in the 1970s and 1980s as the new, wonderful world of molecular biology sprang up. The new capability to isolate, manipulate and use nucleic acids seemed to bypass the need to infer idealised things statistically from whole organismal systems. There was also a push to use forward genetics (i.e. large genetic screens) in model organisms – in fact, some of the best work was based on the same organism (Drosophila) on which the original genetics was worked out.
In humans, the most systematic discovery of single locus genetic diseases – Mendelian disorders – led to epic, systematic “mapping” trips for these (CFTR being perhaps the most iconic). Only the plant and animal breeders kept the flame of quantitative genetics alive during this time.
Human genome + Dolly the sheep + transparent pigs = great time to be in genetics
The process of mapping human disease-causing genes was a big part of the motivation for the human genome project. At some point it became clear that rather than being done as a collection of rather individual laboratory efforts, this mapping would be far better off taking a systematic approach. John Sulston and Bob Waterston surprised everyone by showing that sequencing a large complex genome (in this case, C. elegans) could be done by scaling up the process of sequencing in a rather factory-like way – and this was just in time for the larger-scale human genome project.
(At the time all this was getting underway, I was an undergraduate transitioning to a PhD student at the Sanger Institute. I came into it during a rather exciting three-year period when the public project grappled with the entry of a privately funded competitor, Celera. It was a fascinating time but I should perhaps write about it in another blog post. Back to quantitative genetics!)
I was introduced properly to quantitative genetics when I became part of the governance board of the Roslin Institute. (I still don’t know who suggested me as a Board member – Alan Archibald? Dave Burt?) Scientifically and politically, it was a real eye-opener. I learned (rather too much) about how BBSRC Institutes are run and constituted, and started to appreciate the tremendous power of genetics.
The Roslin was in the news for the successful nuclear transfer from a somatic cell to make a new embryo, Dolly the sheep, and this was great work. (They also made transgenic chickens and pigs – a glow-in-the-dark pig is quite a thing to see, and surprisingly useful). Together with the Genetics department in Edinburgh, the Roslin was one of the few places that still took quantitative genetics seriously. From talks with Chris Haley and Alan Archibald (and beers/scotch) I started to realise the awesome end-run that quantitative genetics does around problems.
The great thing about genetics is that if you can measure something reliably and thus find the variance between individuals that is due to genetics, you can identify genetic components with the usual triad of good sample size, good study design and good analysis. How well you can do this requires understanding architecture of the trait (e. g. number of contributing loci, distribution of effect sizes) – itself something non trivial to work out, but in theory all measurable things with some genetic variance are discoverable, given enough samples.
No guesswork required
Let me stress: you don’t need to guess where to look in the genome to associate genetic variants to some you measure, so long as you test the entire genome sensibly. So – you needn’t have a profound mechanistic understanding of what you want to measure, – you just need the confidence that there is some genetic component there to and the ability to measure it well. Step back for a moment and think about all the things you want to know about an organism on the genetic level that you could measure. It could be something simple to measure, like height (which hides a lot of complexity), or something more complex, like metabolites (which are often driven by far simpler genetics), or something really complex, like mathematical ability in humans.
Chris, Alan and others really knew that quantitative genetics worked. For their own part, they were often paid by animal breeders (companies breeding chickens, pigs and cows) to improve rather complex traits such as milk yield or growth rate. These companies may have appreciated that good science was nice to have, but they were pragmatic and ultimately driven by the bottom line. If something did work, they would come back to Chris and Alan. Sure enough, Roslin had a lot of repeat business.
During this time I started to realise that a lot of quantative genetics methods developed at the end of 20th Cent. were aimed to circumvent the facts that genomes are big and determining markers in many individuals is expensive. As next generation sequencing came online, I could see we were going to have a completely inverted situation: genotyping was going to get stupid cheap and dense, and accordingly any set of phenotypes (things you can measure) would become ammeanable to these techniques (again, with the caveats of appropriate sample size). And I really mean any phenotypes.
What to measure?
At some level, this is ridiculous. There are so many things you want to know about, on many scales – molecular, cellular, organism – and to have one technique (measure and associate) seems just a bit too conceptually simple. Of course, it’s not so simple – there must be some genetic variance to get a handle on the problem (sometimes there are just no variants that have an effect in a particular gene), and the genetic architecture might be very complex, but these assumptions are reasonable for many, many interesting things – in particular the more “basic” things about living organisms. The big problem is choosing what to do. So, about five years ago I started to think more about what might be the most interesting things to measure, and in which systems.
For measurements it seems crazy to have a single variable captured each time. Why measure just one thing? You want a high-density phenotyping method so you can look at lots of variables at the same time. This leads you to two technologies for readout: Next-gen sequencing (all gene expression, or all binding sites) for molecular events or imaging cellular and organismal things.
I knew I wanted to stay away from human disease as a system. This is a hotly contested field with a few big beasts controlling the drive towards different endpoints. I like working with these big beasts (they are usually very friendly and clever) but know it would be foolish to try and break into their area (far better to collaborate!). Talking to Chris and Alan (and later with epidemiologists George Davey-Smith and John Danesh), I came to realise that disease brings in all sorts of complications — such as confounders of diagnosis and treatment.
I also knew I wanted to work on things where you could close the loop from initial discovery to specific mechanism/result. and I wanted to work with experimental systems we could manipulate many times over….
Back to the paper…
And this, finally, leads us back to the new paper. We worked on a molecular phenotype: CTCF binding. CTCF is an interesting transcription factor that has a great chip antibody, which was great for the feasibility of our experiments. CTCF can be measured in a high-content manner (i.e. Chip-seq) when this is done using an experimentally reproducible system (i.e. human LCL cells; 50 of the 1000Genomes cell lines, so we had near-complete genotypes). It fits two criteria – it’s a high dimensional phenotype (there are around 50,000 good CTCF sites in the genome) and we did tin an experimental system that we could go back to – LCLs. It is, arguably, my lab’s first “proper” quantitative genetics paper. As ever when you switch fields, the project brought with it a multitude of details to sort out and ponder.
For example, we had to learn to love qq-plots more than box-plots. QQ (quantile-quantile) plots make a big difference in whether your associations make sense, given the multiple testing problem. As the genome is a big place, you are going to test a lot of things, you are guaranteed to find “visually interesting” associations (boxplot looks good, Pvalue looks good) and you will not have any idea whether they are interesting enough given the number of tests. And – rather more subtly, you need to know whether your test is well behaved. A QQ plot summarises both of these neatly.
There are deeper gotchas – notably population stratification – and getting a good feel for linkage disequilibrium does take a while. There are, of course, all sorts of measurement/technical issues, and in this case my experience with ENCODE gave me a good grounding in the practicalities of Chip-seq.
Much of what we discovered was to be expected. Indeed, variants do effect CTCF binding, particularly when they are in the CTCF binding motif (I can see a number of CTCF/transcription factor molecular biologists rolling their eyes at the obviousness of this). But there are a sizeable number of variants that are not in the motif and, presumably, affect binding via some other indirect mechanism (LD presents some complications here, but we’re in a good position to assess this). We also saw a large number of allele-specific signals. Interestingly, we can show that when there is a difference between genetic variants for binding between individuals, it is related linearly to allele-specific levels in hetreozygous individuals – but not the other way around. But – there are allele-specific sites that do not show between-individual differences. This is not perhaps that surprising but good to see, and good to quantify the level of this.
And then biology just does something unexpected. As my student Sander Timmer was looking to find individual CTCF-to-CTCF “clean” correlations, he came upon a stubborn set of sites that formed a big hairball of correlations. This is not uncommon for this sort of data (there are all sorts of reasons why you get correlations: antibody efficiency, growth conditions, weird unplaced bits of genome present in one set but not another…). We were digging through all these options (blacklisting, weird samples, weird bits of genome) and this hairball just was not going away. I was almost at the stage of just acknowledging and accepting the hairball, assuming it was some artefact and moving on to the more individual site correlations, when Sander came in showing that nearly all the sites were on the X chromosome.
It’s easier to describe this from the perspective of what we believe is going on than to talk about feeling around the data for three months or so, trying to work it all out. The X chromosome is a sex chromosome, and males have one copy whereas females have two copies. This causes a headache for mammals in that if the X chromosome behaved “as normal”, females would consistently show twice the expression of all X chromosome genes than males.
In mammals this is “solved” by X chromosome inactivation, where one X chromosome in females is “inactivated” by quite an extreme molecular process called (unsurprisingly) X-inactivation. (Biology, in its weird and wonderful way, does this completely differently in other animal clades, eg, C.elegans or Drosophila. Go figure). This leads to a visibly compressed X chromosome in females (called the Barr body after the first person to characterise it), and this random inactivation underlies some classic phenotypes, for example tortoise shell cat (or calico cat, in US-speak) coat colouring is due to this. Multi-coloured eyes (sometimes in the same iris) in women is due the random choice of which X chromosome to inactivate.
When you look at RNA levels in female cell lines, the vast majority of X chromosome genes have a similar expression level as males. There are exceptions (called “X-chromosome inactivation escape”) and one famous, female-specific RNA – Xist – is a key molecule in X inactivation (see my previous post about the wonders of RNA).
What is CTCF doing?
All this is well established, but what did we expect to see for CTCF? CTCF is a very structural chromatin protein involved in chromatin loops. Sometimes these loops are very important for gene regulation, giving rise to CTCF’s role in insulators. Sometimes there is some other looping mechanism. And there are CTCF binding sites everywhere – X chromosome included. So we thought there were three basic options for each CTCF site (there are ~1000 CTCF sites on X):
- CTCF site is rather like RNA – mainly suppressed, present only on the active X. If so, we’d expect males and females to have similar levels of CTCF.
- CTCF site is involved in X-inactivation (perhaps a bit, perhaps a lot) – If so, we’d expect there to be female-specific CTCF sites.
- CTCF site is oblivious to X-inactivation In this case we’d expect a ratio of 1:2 male:female.
Before I go any further, it’s worth being sure of this result because a lot of things can happen/drive signal in these Chip-seq – or any genomics – experiment. The clincher for me was when we looked at individuals who were hetreozygous for alleles in option 1 vs option 3. For sites classified as “single active” (option 1) we see one predominant allele – consistent with just one chromosome being bound. For sites classified as “both active” (option 3) we see mainly a 50/50 split of alleles – consistent with both chromosomes being bound.
Here, the process of X chromosome inactivation allows us to pull apart at least two classes (probably more things are happening within these classes as well), and thus get a large number of CTCF sites in two classes. And indeed, they look very different. One (single active) is alive with activating histone modifications and the other (both active) is pretty silent (don’t know your histone modifications? Here’s a cheat’s guide). But the “both active” set has just as strong an overlap with conserved elements as the “single active” and, if anything, shows stronger nucleosome phasing around it (so it’s definitely there).
More questions than answers
We didn’t take on the study to find out whether female-specific CTCF sites are wrapping up RNA on the inactive X, but in this respect our work raises more questions than it answers. Can we parse CTCF sites further out with bioinformatics? (There is a great story about CTCF site deposition by repeats – is this linked?) How conserved is the CTCF site status between close mammals and, given that we know many of them are in different locations in mouse, is there something common to these sites we should be looking at? Does anyone want to … knock some of these sites out? Do we know how to phenotype them? Our work showed that CTCF is not the molecule involved in Barr body compaction, so… what is?
So – this was a great paper to start my group into quantitative genetics. In many ways, the HipSci project (which I am part of) is the really well powered (many more samples) and better constructed (iPSCs rather than LCLs) – this paper is sort of a training ground for what to expect for genetic effects on chromatin, and joins a long line of RNAseq QTLs (eQTLs) and DNaseI or methylation QTLs by other groups.