As more and more of research into human health is switching to human subject research, including a considerable amount of molecular work, molecular biologists and bioinformaticians are going to have to get far more comfortable with epidemiology than ever before. There are some big elephant trap like mistakes you can make – and indeed I see people making them right now (and, embarrassingly, I’ve stepped in one of two of these traps myself; one lives and learns ). One reason why there are so many traps for the unwary is because our first collective foothold in epidemiology is (in effect) genome wide association studies. This has the rare property that one of the key variables being measured (genotypes) both (a) do not change over time and (b) are relatively uniformly distributed relative to other things we can measure about people. I’ll return to this below.
First, a quick crash course in epidemiology (and apologies if I misstate things – please comment below if you are professional epidemiologist and you think I’ve missed something). Epidemiology is about achieving an understanding of usually human health behaviour using observation – one draws inference about the world. The key thing being measured is correlation between two observed phenomena – height and weight, or heavy metals in the water and birth defects, or distance to a water pump and the incidence of disease in a city. However, you rarely know whether you have measured all the important variables (maybe it should be height, weight and eye colour?) – there might be a variable which is better correlated – and even if you have, correlation shows some linkage between the variables, but neither tells you the causation route nor tells you whether there isn’t some third factor (a confounder) which is correlated to both of your variables, but the two measured variables are not otherwise directly linked. A whole bunch of things in life are correlated – the amount of per capita tea consumption is reasonably well correlated with how many people play cricket in a country, but this isn’t a causal link between tea drinking and cricket aptitude – it’s a confounder of how the British empire left traces of culture across the world. Far more complex cases occur – the protein CRP is higher in people who have had heart attacks, and the amounts of CRP vary in the population. For a while a feasible hypothesis was that CRP levels could be used as a biomarkers of heart attack risk (and might even been directly on a causal pathway). In fact, clever epidemiology (with the use of some genetic variables) managed to clearly refute this hypothesis, implying that the correlation was that heart attacks changed the level of CRP (not the other way around).
Epidemiologists have a variety of increasingly advanced statistical tools, but the bedrock is the set of observations they are run over. How these observations are collected is called “study design”. Crudely there are three types of study design.
Case Series. This is the simplest, closest to anecdote, where a series of similar scenarios are described. This is best thought of as the “hypothesis generating” part of a formal process, but something which is not just a hunch but a hunch by someone with good data (observation) to back up it, although this data has not been collected in a manner which withstands considerable statistical scrutiny.
Case Control. This is where you select two contrastring groups – one with a disease and one without, often trying to match (“control for”) as many other variables which might lead to confounding – for example, you might match cases and controls for age and sex, and for the location in the country they were collected for. Having done this you can now look at measurements between the two cohorts, and contrast them, often with sophisticated models which try to explicitly model as many of the confounders as you can. The major drawback of this approach is that because you are selecting due to some criteria which you are interested in, and thus measuring after you have made that selection, even when measurement X is different between your “case” vs “control” this could be due to X being causal (or at the least, observable before the selection) or that X is in fact a consequence of whatever you’ve split by. So CRP plasma levels being correlated with heart attacks in a case control study might mean CRP plasma levels is causal (or correlated to a causal event) present before the heart attack, or that CRP levels change due to a heart attack. This problem is sometimes described as “reverse causation”. The other major headache is whether there is some unmeasured confounding variable that in fact would “explain away” the correlation that you see – because you have selected people in some way, however you have done might have dragged in some other confounders that unless you also measure them, you wont know about.
Cohort Studies. This is the best scenario. Here a cohort (ie, a set of people) are collected under some broad criteria, and then they monitored over time; the time “arrow” helps disambiguate this reverse causation. As one measures things in the future these are often called a “Prospective study or Prospective cohort” and if you regularly measure things at a variety of time points, this is called a “longitudinal cohort”. Some of best studies have the collective criteria as being simply the birth date of people in some geographic area – almost the most unbiased you can get – these are called Birth Cohorts. Even better still is one, in effect, ones entire population becomes the study, such as in Iceland, some areas of Finland, or proposed in Tayside, Scotland. The great thing about this sort of very broad, unbiased selection is that any confounders at least should not be in the selection process – the only two headaches which remain is choosing which appropriate variables to measure, the frustration of having to wait a long time to gather both early and late life measurements (Birth Cohorts are impressive in their time reach – read this paper that taps into 1946 Birth Cohort in Scotland). Coupled with the time dimension to help constrain which correlations are consistent with being causal means that Epidemiologists love birth cohorts. The consequence of having broad criteria means that for things that are even only moderately uncommon one needs to have a large initial cohort to be able to get high enough numbers. The other headache is simply the length of time and the cost for running these studies. Nevertheless, this is as good as it gets in Epidemiology.
As I mentioned genome wide association studies mainly use Case-Control studies. This is totally pragmatic, as the need to get a reasonably high sample size to get statistical significance (given how big the genome is) means that one has to the numbers; only for the most common diseases can the numbers be supplied from prospective studies (and even then it can be challenging). But recruiting 1,000 patients with some specific disease is often logistically challenging but usually totally feasible given money and effort. However, two properties of genotypes really help us here; First is that your genotype does not change over time. This means that measuring someone’s genotype after they have got the disease is the (basically) the same as measuring the genotype beforehand (give or take the odd Ig locus and perhaps a very small chance of funky somatic CNVs – let’s not worry about these edge cases). This is a total gift for the reverse causation problem as it means we can trust the genotype variable as something that can only cause differences, but not be caused by differences. Some people describe this variable as being therefore “anchored”. The other, more subtle, gift of genetics is that there are traditionally not too many confounders with genetics (in a given well mixed population, the genotype variables are seemingly assigned at random, complying to predicted behaviours such as Hardy-Weinberg equilibrium). When there is a confounder with genetics, which is nearly always due to population structure (basically, not complete mixing of a population in some way) then this has a very characteristic signal the genotypes, as population structure gives a pretty homogeneous signal across the genome. One can easily recognise it and then one has two options. The crude (but effective) one is to drop a minority of individuals from the study that contributes the majority of the signal (this only works if your cohort predominantly comes from one population) or more sophisticate schemes use these global signals as additional explanatory variables in more complex statistical models. This ability to spot these confounders relies on the fact that the genome is a very big place, and that chromosomal assortment and recombination is random; therefore when there is a non-locus specific confounder with genotypes, one has a lot of data and a good random model of the mixing process across the genome to discover and use to correct.
The key thing is that properties of genotypes removes the two major drawbacks of Case-Control studies – no reverse causation, and harder to be confused by confounders. Indeed epidemiologists are making really good use of both of these in the context of longitudinal studies, where even weak genetic effects on a particular trait allow the disentangling of the other, more complex variables – their broad, random assignment, coupled with the inability to suffer reverse causation is a total gift to epidemiologists. These techniques are sometimes called Mendelian randomisation (the idea being that you use the random assignment of alleles at birth in the same way you use a randomised trial).
But back to the main theme: as molecular biology interfaces with more of medicine, it’s obvious that we shouldn’t just be measuring genomic sequence (whether by genotypes or by sequencing); why not measure RNA? Or metabolites? Or methylation patterns of DNA? Or Histone modifications? Or Gut microbiota? Each of these – and many other molecular measurements – are becoming increasingly cost effective. They all throw up their challenges, in particular getting access to the right tissue, and the data processing into reliable signals. But, sadly, they all also come without these two useful features of genotypes; because we know that all these measurements will (or at least could) change over time, one is stuck with the problems of reverse causation, and it is far harder to appeal to some sort of flat, random process of how unwanted confounders would interact with them.
Despite them being similar to genotypes in that they are molecular measurements, in terms of statistics these are “just another variable” that one can measure on a person. The downside of this is that in a Case Control study, one might be very confident about the correlation of a (non-genotype based) molecular measurement and a phenotype – methylation patterns and diagnosis of alzheimers, or gut microbiota and weight but… these can never tell you the direction of causality, and even more worrying, one has to be paranoid that the selection of your cases and controls did not involve some “better” explanatory confounder that you didn’t decide to measure. I’ve seen many molecular biologists under appreciate (or just not know) about these things, in particular the confounding issue, and very often the response is to increase the study size, as if this statistical based criticism will be answered by throwing numbers at the problem. This is not case sadly; one has to think about the design.
So – what should one do? Clearly these other molecular measurements are at least as interesting, if not far more “downstream” of genotypes to disease. I think there are three broad approaches to remove or mitigate these issues.
The first is the way an epidemiologist would do it – work with a Cohort Study. There are serious logistical aspects to this, in particular correlating early events to late ones (as one has to be around to get the right sample at the right time!) but even then there is increasing sophistication of the current cohorts and even a small amount of time separation can make all the difference in understanding causation. Equal to this logistical challenge is the cultural problem of molecular biologists/genomics people working with epidemiologists. As in any interface between discplines there can be both misunderstandings and sort of jockeying for position in terms of leadership and recognition. Stepping back it’s obvious that both fields bring a lot to the table, and this sort of science will only occur in a equal, respectful collaboration, from which one draws from the best of the traditions and behaviours from both sides. This sort of complex dance has been going on for at least the last decade, but I expect to only increase, and we should welcome it. So – find your nearest cohort study and discuss potential ways at getting at your problem!
The second is more a mitigation strategy than a solution – make sure as well as liable molecular measurements (methylation, RNA, what have you) you also genotype on a case-control study. Then one has at least three variables of interest – the end phenotype (likely to be something disease related), molecular measurements and genotypes. The beauty is that with the genotypes you independently show causative correlation potentially genotype and both another molecular assay and the phenotype. When this is the same genotype locus, you’ve immediately constrained the problems of reverse causation into a smaller set, and although there is certainly the possibility of indirect rather than direct casual links (eg, through some other, not measured, molecular pathway) it’s going to be far more persuasive.
The final one is to build on the second, but also include direct experimentation. If you can manipulate the system experimentally in a cell line or primary culture to show that there is direct causal link between a genotype and a molecular measurement, then you’ve got really a perfect statistical tool to dissect out the correlation between a molecular measurement and phenotype. George Davy Smith and colleagues have been applying this more in an Epidemological context, via things like instrument analysis; there is no reason they can’t work in this molecular area, but it does require knowing a priori that certain causal links are true.
To reiterate the main point: Case-Control studies can mislead you due to both reverse causation and confounding. Genotype based Case-Control studies use special properties of genomic data to strongly reduce the chance of being mislead. But all other molecular measures will suffer from this. If you are a reviewer of a paper or a panel member, remember than study design is more than just numbers – it is also about whether you can answer these questions.
There are a lot of poeple out there with Statistical, Epidemiological and genetics background who know far more about this than me. I would be very interested in both specific and general comments on this post. (thanks for contributions/fixes).