I was discussing with a maths minded friend about the difference between “quantitative” and “non quantitative” science, mainly on how biology had to get its quantitative mojo back, and I said that a good proxy for whether someone was “quantitative” or not is whether they are at home with logarithms – do they use them, are they comfortable about logs between 0 and 1, can they read log plots?
Logarithms are extremely useful in many scenarios. To remind the readers who are a bit rusty about them – a logarithm is the number which you have to raise another number (called the “base of logarithm”) to make a third number. You set the base as a constant, and for most scenarios the base actually doesn’t matter (it will shift some offsets or some scales, but not change the shape anything). So – if we set the base to 10, (notation is log10 ) then log10 of 100 is 2, (10 squared is 100), log10 of 1000 is 3 (1000 is 10 cubed, so 10 to the power of 3) etc. log10 of 1 is 0, and that’s true for every base. A slightly looking glass world happens between 0 and 1, where the log of these numbers are negative: log10 of 1/100 is -2, as it is 1 over 10 squared. Logs are not defined for 0 or below (at least for real numbers – one would have use complex numbers). Notice that the “space” between 0 and 1 maps to the space between 0 and negative infinity – one of the nice illustrations of how rational numbers have an infinite number of numbers between them.
When I was first taught about logarithms I found the base both a bit inelegant and also a bit annoying – what base should I “use”? But I’ve come to realise that the base just scales things. log2(10) is 3.3(ish) – log2(100) is 6.6(ish) – log10(10) is 1, log10(100) is 2. There are 3 commonly used bases: base 10, which might be verbalised as “orders of magnitude” e.g: “we are going to need two orders magnitude more disk for that metagenomics experiments”, base 2, which is verbalised as “bits”, and is particularly useful for probabilities/information content, eg “the distribution of amino acids in this column has at least 8 bits of information”. The final base is “e” – one of these magical mathematical constants which pops up all over the place (along with Pi, 1, 0 and others). e can be defined in a variety of ways (I always like the simple sum 1+ 1/1 + 1/(1*2) + 1/(1*2*3)… etc until infinity…). loge has a large number of nice mathematical properties, so it’s called the “natural” logarithm, and the “e” is usually dropped, so it is just written as log. But – don’t forget – the base doesn’t matter really – it’s the logarithm aspect.
So – why use them? Here are 5 good reasons to use logs in biology (and many other places!):
- Using the logarithm of a scale makes multiplications of that scale just change an offset – this is because log(xy) = log(x) + log(y), so multiplying something by 2 or 10 just adds a constant number. There are quite a few experimental things where properties stay constant under multiplication – doubling the experiment might double the variation in absolute terms, but it will just give an offset in log space. As technology development is often a multiplier on the previous years technology development, log space is much better way to look at technology development, whether it is disk space, sequencing output or bandwidth. (As an aside, this is why many financial charts are better read in log(price) rather than price; if you are interested in volatility for example, you are interested not it is absolute level, but rather it’s multiplicative level).
- log(scale) is a very pragmatic way to compress a large range numbers, even if you are not sure that the reason why you’ve got a large set. This is very common in biology where for example one or two genes might be pumping out at an extreme level of RNA, whereas all the other genes are just doing their normal thing. Plot this on a linear scale and everything has to fit in these extreme cases. A log scale is often a far nice scheme to see everything simultaneously, even if there is not a good “reason” for using a log scale. Which leads nicely onto the next case…
- There are a whole bunch of process where process involves an exponential decay or exponential increase (due to something for example happening with constant probability of changing). Here the log() of the read out might well give nice straight lines – for example, frequencies of alleles in the population is often better plotted in a log scale. Here not only does the log scale let you get everything into one plot, but it is also “the right space to work in” – complex looking behaviours might end up looking like (for example) two straight lines in log space.
- log(probability) both (a) stretches the whole probability space out and (b) “correctly” weights the edges of the space over the centre of the space. For example, something going from 80% to 90% to 95% to 97.5% and then 98.75% accuracy is halving its error rate each time. Plotting the log of (1-accuracy) (ie, log of the error) shows this trend far better than looking at these numbers. Without taking logs it’s easy to think that there is not much difference between 99% accurate and 99.99% accurate. In fact, there is huge difference. When you are dealing with likelihoods (probabilities of something happening under a model), the “raw” likelihood is very hard to have any intuition about – log(likelihood) makes these large negative numbers (and then only gotcha is that you are looking for the smaller negative number, ie the one closer to 0, as the “better” one).
- Ratios are often better visualised, and sometimes better used as log(ratio). Raw ratios of two things (x/y) will have everything crammed between 0 and 1 when y is greater than x, but when x is greater than y it seemingly can go on forever. Plotting this looks odd. log(x/y) though is now nicely balanced, with just the same amount of “space” on side of 0 (0 being 1:1).