Managing and Analysing Big Data – Part II

This is the second of three blog posts about planning, managing and delivering a ‘big biodata’ project. Here, I share some of my experience and lessons learned in management and analysis – because you can’t have one without the other.


1. Monitor progress – actively!

You need a good structure to monitor progress, particularly if you’re going to have more than 10 samples of experiments. If this is a one-off, use a program that’s well supported at your institute, like FileMakerPro or… Excel (more on this below). If you’re going to do this a lot, think about investing in a LIMS system, as this is better suited to handling things at a high level of detail routinely. Whatever you use, make sure your structure invites active updating – you’ll need to stay on top of things and you don’t want to struggle to find what you need.

2. Excel (with apologies to the bioinformaticians)

Most bioinformaticians would prefer the experimental group not to use Excel for tracking, for very good reasons: Excel provides too much freedom, has extremely annoying (sometimes dangerous) “autocorrect” schemes, fails in weird ways and is often hard to integrate into other data flows. However, it is a pragmatic choice for data entry and project management due to its ubiquity and familiar interface.

Experimental group: before you set up the project tracking in Excel, discuss it with your computational colleagues, perhaps offering a bribe to soften the blow. It will help if the Excel form template comes out of discussions with both groups, and bioinformaticians can set up drop-downs with fixed text where possible and use Excel’s data entry restrictions to (kind of) bullet proof it.

One important thing with Excel: NEVER use formatting or colour to be the primary store of meaning. It is extremely hard to extract this information from Excel into other schemes. Also, two things might look the same visually (say, subtly different shades of red), but are computationally as different as red and blue. When presentation matters (often to show progress against targets), you or your colleagues can (pretty easily) knock up a pivot table/Excel formula/visual basic solution to turn basic information (one type in each column) into a visually appealing set of summaries.

3. Remember planning?

When you planned the project (you did plan, right?), you decided on which key confounders and metadata to track. So here’s where you set things up to track them, and anything else that’s easy and potentially useful. What’s potentially useful? It’s hard to say. Even if things look trivial, they (a) might not be and (b) could be related to something complex that you can’t track. You will thank yourself later for tracking things when you regress this out. 

4. Protect your key datasets

Have a data ‘write only’ area for storing the key datasets as they come out of your sequencing core/proteomics core/microscopes. There are going to be sample swaps (have you detected them? For sure they will be there for any experimental scheme with more than 20 samples), so don’t edit the received files directly! Make sure you have a mapping file, kept elsewhere, showing the relationships between the original data and the new fixed terms.

5. Be meticulous about workflow

Keep track of all the individual steps and processes in your analysis. At any point, it should be possible to trace individual steps back to the original starting data and repeat the entire analysis from start to finish.

My approach is to make a new directory with soft-links for each ‘analysis prototyping’, then lock down components for a final run. Others make heavy use of iPython notebooks – you might well have your own tried-and-tested approach. Just make sure it’s tight.

6. “Measure twice, cut once”

If you are really, really careful in the beginning, the computational team will thank you, and may even forgive you for using Excel. Try to get a test dataset out and put it all the way through as soon as possible. This will give you time to understand the major confounders to the data, and to tidy things up before the full  analysis.
You may be tempted to churn out a partial (but in a more limited sense ‘complete’) dataset early, perhaps even for a part-way publication. After some experience playing this game, my lesson learned is to go for full randomisation every time, and not to have a partial, early dataset that breaks the randomisation of the samples against time or key reagents. The alternative is the commit to a separate, early pilot experiment, which explicitly will not be combined with the main analysis. It is fine for this preliminary dataset to be mostly about understanding confounders and looking at normalisation procedures.

7. Communicate

It is all too easy to gloss over the social aspect of this kind of project, but believe me, it is absolutely essential to get this right. Schedule several in-person meetings with ‘people time’ baked in (shared dinner, etc.) so people can get to know each other. Have regular phone calls involving the whole team, so people have a good understanding of were things stand at any given time. Keep a Slack channel or run an email list open for all of those little exchanges that help people clarify details and build trust. 
Of course there will be glitches – sometimes quite serious – in both the experimental work and the analysis. You will have to respond to these issues as a team, rather than resorting to finger-pointing. Building relationships on regular, open communication raises the empathy level and helps you weather technical storms, big and small.


1. You know what they say about ‘assume’

Computational team: Don’t assume your data is correct as received – everything that can go wrong, will go wrong. Start with unbiased clustering (heat-maps are a great visualisation) and let the data point to sample swaps or large issues. If you collect data over a longer period of time, plot key metrics v. time to see if there are unwanted batch/time effects. For sample swaps, check things like genotypes (e.g. RNAseq-implied to sample-DNA genotypes). If you have mixed genders, a chromosome check will catch many sample swaps. Backtrack any suspected swaps with the experimental team and fail suspect samples by default. Sample swaps are the same as bugs in analysis code – be nice to the experimental team so they will be nice when you have a bug in your code.
Experimental team: Don’t assume the data is correct at the end of an analytical process. Even with the best will in the world, careful analysis and detailed method-testing mistakes are inevitable and flag results that don’t feel right to you. Repeat appropriate sample-identity checks at key time points. At absolute minimum, you should perform checks after initial data receipt and before data release.

2. One thing you can assume

You can safely assume that there are many confounders to your data. But thanks to careful planning, the analysis team will have all the metadata the experimental team has lovingly stored to work with.
Work with untrained methods  (e.g. straight PCA; we’re also very fond of PEER in my group), and correlate the known covariates. Put the big ones in the analysis, or even regress them out completely (it’s usually best to put them in as terms in the downstream analysis). Don’t be surprised by strongly structured covariates that you didn’t capture as metadata. Once you have convinced yourself that you are really not interested, move on.
(Side note on PCA and PEER: take out the means first, and scale. Otherwise, your first PCA component will be means, and everything else will have to be orthogonal to that. PEER, in theory, can handle that non-orthogonality, but it’s a big ask, and the means in particular are best removed. This means this is all wrapped up with normalisation, below.)

3. Pay attention to your reagents

Pay careful attention to key reagents, such as antibody or oligo batches, on which your analysis will rely. If they are skewed, all sorts of bad things can happen. If you notice your reagent data is skewed, you’ll have to make some important decisions. Your carefully prepared randomisation procedure will help you here.

4. The new normal

It is highly unlikely that the raw data can just be put into downstream analysis schemes – you will need to normalise. But what is your normalisation procedure? Lately, my mantra is, “If in doubt, inverse normalise.” Rank the data, then project those ranks back onto a normal distribution. You’ll probably lose only a bit of power – the trade-off is that you can use all your normal parametric modelling without worrying (too much) about outliers. 
You need to decide on a host of things: how to correct for lane biases, GC, library complexity, cell numbers, plate effects in imaging. Even using inverse normalisation, you can do this in all sorts of ways (e.g. in a genome direction or a per-feature direction – sometimes both) so there are lots of options, and no automatic flow chart about how to select the right option.
Use an obvious technical normalisation to start with (e.g. read depth, GC, plate effects), then progress to a more aggressive normalisation (i.e. inverse normalisation). When you get to interpretation, you may want to present things in the lighter, more intuitive normalisation space, even if the underlying statistics are more aggressive.
You’ll likely end up with three or four solid choices through this flow chart. Choose the one you like on the basis of first-round analysis (see below). Don’t get hung up on regrets! But if you don’t discover anything interesting, come back to this point and choose again. Taking a more paranoid approach, using two normalisation schemes through the analysis will give you a bit of extra security – strong results will not change too much on different “reasonable” normalisation approaches.

5. Is the Data good?

Do a number of ‘data is good’ analyses.

  • Can you replicate the overall gene-expression results? 
  • Is the SNP Tv/Ts rate good? 
  • Is the number of rare variants per sample as expected? 
  • Do you see the right combination of mitotic-to-nonmitotic cells in your images? 
  • Where does your dataset sit, when compared with other previously published datasets? 

These answers can guide you to the ‘right’ normalisation strategy – so flipping between normalisation procedures and these sorts of “validation” analyses helps make the choice of the normalisation.

6. Entering the discovery phase

‘Discovery’ is a good way to describe the next phase of the analysis, whether it’s differential-expression or time-course or GWAS. This is where one needs to have quite a bit more discipline in how to handle the statistics.

First, use a small (but not too small) subset of the data to test your pipelines (in Human, I am fond of the small, distinctly un-weird chromosome 20). If you can make a QQ plot, check the QQ plot looks good (ie, calibrated). Then, do the whole pipeline.

7. False-discovery check

Now you’re ready to apply your carefully thought-through ‘false discovery rate’ approach, ideally without fiddling around. Hopefully your QQ plot looks good (calibrated with a kick at the end), and you can roll out false discovery control now. Aim to do this just once (and when that happens, be very proud). 

8. There is no spoon

At this point you will either have some statistically interesting findings above your false discovery rate threshold, or you won’t have anything above threshold. In neither case should you assume you are successful or unsuccessful. You are not there yet.

9. Interesting findings

You may have nominally interesting results, but don’t trust the first full analysis. Interesting results often enrich errors and artefacts earlier on in your process. Be paranoid about the underlying variant calls, imputation quality or sample issues.
Do a QQ plot (quantile-quantile plot of the P values, expected v. observed). Is the test is well calibrated (i.e. QQ plot starts on the expected == observed, with a kick at the end)? If you can’t do a straight-up QQ plot, carry out some close alternative so you can get a frequentist P value out. In my experience, a bad QQ plot is the easiest way to spot dodgy whole-genome statistics.
Spot-check that things make sense up to here. Take one or two results all the way through a ‘manual’ analysis. Plot the final results so you can eyeball outliers and interesting cases. Plot in both normalisation spaces (i.e. ‘light’ and aggressive/inverse).

For genome-wide datasets, have an ‘old hand’ at genomes/imaging/proteomics eyeball either all results or a random subset on a browser. When weird things pop up (“oh, look, it’s always in a zinc finger!”), they might offer an alternative (and hopefully still interesting, though often not) explanation. Talk with colleagues who have done similar things, and listen to the war stories of nasty, subtle artefacts that mislead us all.

10. ‘Meh’ findings

If your results look uninteresting:

  • Double check that things have been set up right in your pipeline (you wouldn’t be the first to have a forehead-smacking moment at this point). 
  • Put dummy data that you know should be right into the discovery pipeline to test whether it works. 
  • Triple-check all the joining mechanisms (e.g. the genotype sample labels with the phenotype). 
  • Make sure incredibly stupid things have not happened – like the compute farm died silently, and with spite, making the data files look valid when they are in fact… not.

11. When good data goes bad

So you’ve checked everything, and confirmed that nothing obvious went wrong. At this point, I would allow myself some alternative normalisation approaches, FDR thresholding or confounder manipulation. But stay disciplined here.

Start a mental audit of your monkeying around (penalising yourself appropriately in your FDR). I normally allow four or five trips on the normalisation merry-go-round or on the “confounders-in-or-out” wheel. What I really want out of these rides is to see a P value / FDR rate that’s around five-fold better than a default threshold (of, say 0.1 FDR, so hits at 0.02 FDR or better).

Often you are struggling here with the multiple testing burden if there is a genome-wide scan. If you are not quite there with your FDRs, here are some tricks: examine whether just using protein-coding genes will help the denominator, and look at whether restricting by expression level/quantification helps (i.e. removing lowly expressed genes which you couldn’t find a signal in anyway). 

You may still have nothing over threshold. So, after a  drink/ice cream, open up your plan to the “Found Nothing Interesting” chapter (you did that part, right?) and follow the instructions. 

Do stop monkeying around if you can’t get to that 0.02 FDR. You could spend your career chasing will-o-the-wisps if you keep doing this. You have to be honest with yourself: be prepared to say “There’s nothing here.” If you end up here, shift to salvage mode (it’s in the plan, right?).

12. But is it a result?

Hopefully you have something above threshold, and are pretty happy as a team. But is it a good biological result? Has your FDR merry-go-round actually been a house of mirrors? You don’t want to be in any doubt when you go to pull that final trigger on a replication / validation experiment.
It may seem a bit shallow, but look at the top genes/genomic regions, and see if there is other interesting, already-published data to support what you see. I don’t, at this point, trust GO analysis (which often is “non random”), but the Ensembl phenotype per gene feature is really freakily useful (in particular with its ‘phenotypes in orthologous genes’ section) and the UniProt comments section. Sometimes you stumble across a complete amazing confirmation at this point, from a previously published paper. 
But be warned: humans can find patterns in nearly everything – clouds, leaf patterns, histology, and Ensembl/UniProt function pages. Keep yourself honest by inverting the list of genes, and look at the least interesting genes from the discovery process. If the story is overtly consistent from bottom to top, I’d be sceptical that this list actually provides confirmation. Cancer poses a particular problem: half the genome has been implicated in one type of cancer or another by some study.
Sometimes though you just have a really clean discovery dataset, with no previous literature support, and you need to do the replication in place without any more confidence that your statistics are confirming something valuable.

13. Replication/Validation

Put your replication/validation strategy into effect. You might have baked it into your original discovery. Once you are happy with a clean (or as clean as you can get) discovery and biological context, it’s time to pull the trigger on the strategy. This can be surprisingly time consuming.
If you have specific follow-up experiments, sort some of them out now and get them going. You may also want to pick out some of the juiciest results to get additional experimental data to show them off. It’s hard to predict what the best experiment or analysis will be; you can only start thinking about these things when you get the list.
My goal is for the replication / validation experiments to be as unmanipulated as possible, and you should be confident that they will work. It’s a world of pain when they don’t!

14. Feed the GO

With the replication/validation strategy underway, your analysis can now move onto broader things, like the dreaded GO enrichments. Biology is very non-random, so biological datasets will nearly always give some sort of enriched GO terms. There are weird confounders both in terms of genome structure (e.g. developmental genes are often longer on the genome) and confounders in GO annotation schemes.
Controlling for all this is almost impossible, so this is more about gathering hints to chase up in a more targeted analysis. Or to satisfy the “did you do GO enrichment?” requirement that a reviewer might ask. Look at other things, like related datasets, or orthologous genes. If you are in a model organism, Human is a likely target. If you are in Human, go to mouse, as the genome-wide phenotyping in mouse is pretty good now). Look at other external datasets you can bring in, for example Chromatin states on the genome, or lethality data in model organisms.

15. Work up examples

Work up your one or two examples, as these will help people understand the whole thing. Explore data visualisations that are both appealing and informative. Consider working up examples of interesting, expected and even boring cases.

16. Serendipity strikes!

Throughout this process, always be ready for serendipity to strike. What might look like a confounder could turn out to be a really cool piece of biology – this was certainly the case for us, when we were looking at CTCF across human individuals and found a really interesting CTCF behaviour involved in X inactivation. 
My guess is that serendipity has graced our group in one out of every ten or so projects – enough to keep us poised for it. But serendipity must be approached with caution, as it could just an artefact in your data that simply lent itself to an interesting narrative. If you’ve made an observation and worked out what you think is going on, you almost have to create a new discovery process, as if this was your main driver. It can be frustrating, because you might now not have the ideal dataset for this biological question. In the worst case, you might have to set up an entirely new discovery experiment.
But often you are looking at a truly interesting phenomenon (not an artefact). In our CTCF paper, the very allele-specific behaviour of two types of CTCF sites we found was the clincher: this was real (Figure 5C). That was a glorious moment.

17. Confirmation

When you get the replication dataset in, slot it into place. It should confirm your discovery. Ideally, the replication experiments fit in nicely. Your cherry-on-the-cake experiment or analysis will show off the top result.

18. Pat on the back if it is boring

The most important thing to know is that sometimes, nothing too interesting will come out of the data. Nobody can get a cool result out every large scale experiment. These will be sad moments for you and the team, but be proud of yourself when you don’t push a dataset too far – and for students and postdocs, this is why having two projects is often good. You can still publish an interesting approach, or a call for larger datasets. It might be less exciting, but it’s better than forcing a result.


4 thoughts on “Managing and Analysing Big Data – Part II

  1. I think data analysis is going to have a huge impact on management theory and worker-employer relations. Thanks to the analytic tool we have, it is possible to find out trend previously hidden. Our company is using machine learning algorithms to process resumes and job applications. It tends to be very useful.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s