Seven tips for bio-statistical analysis of gene expression data

Jo Vandesompele - Dec 11, 2013

Many scientists have a hate-love relationship with statistics. Personally, I didn’t like statistics (at all) during my masters degree education [1]. Too theoretical, didn’t see the utility of it. Only when I generated my first data during my PhD research, I started realizing the necessity and power of bio-statistics. Later, I almost really fell in love with statistics after reading Intuitive biostatistics by Harvey Motulsky. This excellent book is written by an author who graduated from medical school; this probably explains why it contains only the most pertinent formulas. I particularly appreciate the book as it really is intuitive; it almost reads like a novel, and you could read it in bed, next to the fireplace with a glass of your favorite wine, or even when you’re on holidays. If you always felt the need to sharpen your basic bio-statistics skills, then this book may be really something for you.

In August of 2013, Nature Methods has initiated a new column 'Points of Significance’ devoted to statistics. The corresponding article on Nature Method’s methagora blog holds a continuously updated list of the Points of Significance articles. I find these very valuable.

Obviously, this blog does not aim to serve as a crash course on statistics. Instead, I would like to focus on seven very simple but fundamental principles for doing bio-statistics yourself, especially when you’re handling gene expression data. Some of this information is also available in a book chapter that Jan Hellemans and I wrote for "PCR Troubleshooting and Optimization: The Essential Guide” (Caister Academic Press, 2011, ISBN: 978-1-904455-72-1): Quantitative PCR data analysis – unlocking the secret to successful results.

1. Always log transform your gene expression data [2]

gene expression levels: log vs linear

Gene expression levels are heavily skewed in linear scale: half of the data-point (the lower expressed genes) are between 0 and 1 (with 1 meaning no change), and the other half (the higher expressed genes) between 1 and positive infinity. Consider the case where the normalized expression levels are 0.1 (A), 1 (B) and 10 (C) for 3 samples (A-C) under study (figure above, left panel). Intuitively, we understand that sample A has a ten-fold lower expression compared to sample B, and that C has a ten-fold higher expression compared to B. However, in linear scale A and B are much closer (similar) to each other than B and C (0.9 units versus 9 units). A parametric statistical test will therefore be biased and not appreciate that A and C are equally different from B. Upon log transformation (I use base 10 here, but any base will do), the distance between A and B, and between B and C becomes equal (1 log10 unit, as the log10 values of A, B, and C are -1, 0 and 1) (figure above, right panel). Log transformation makes your data more symmetrical and therefore, a parametric statistical test will provide you with a more accurate and relevant answer.

2. Consider pairing

Paired information means that values in one group are related to the values in the other group. Typical examples are gene expression levels measured on different cell lines treated with a compound or vehicle control, or measurements on mice before and after an intervention. If there is pairing information in your data set, then you should use a statistical test that takes this information into account. If not, there is a big chance that you may miss a significant effect, as the pairing will help to cancel out sample specific differences.

3. Choose a proper statistical test before you start your analysis

Choosing between a rank based (non-parametric) (e.g. Mann-Whitney for comparing two unpaired groups) and parametric test (e.g. Student’s t-test for comparing two unpaired groups) is perhaps the most difficult task. If your sample size is large enough (different opinions here, but 2 dozen is a safe number), it does not really matter what type of test you use as you can rely on the central limit theorem that states that sample means are normally distributed. However, when the sample size is small, you cannot rely on this theorem. Here, a parametric test may be too optimistic (too low P-value), and a non-parametric rather conservative (higher P-value). Being better safe than sorry, I generally recommend using a non-parametric test; if you get a significant (non-parametric) P-value when working with small sample sizes, it is more likely to reflect a true difference. Of note, there is a lower limit of the sample size when applying a non-parametric test (see tip 6). Importantly, decide on your test before you start your analysis and stick to it. Don’t be tempted to try different statistical tests and pick the one that supports your hypothesis.

4. Don’t underestimate the value of confidence intervals

It’s a good practice to add error bars to calculated results. However, it seems that scientists are sometimes uncertain what type of error bars should be used. For normalized relative quantities, the compound error should reflect the variance on the measured replicates of the target of interest and the reference genes, as well as the uncertainty on the estimated PCR efficiency (and optionally the errors related to the inter-run calibration procedure), along with correct propagation of all error values (see Hellemans et al., Genome Biology, 2007 for relevant formulas for error propagation). In this scenario, and for a single biological sample, the standard error of the mean (SEM) is a much more appropriate measure of uncertainty than the standard deviation (the latter only being a measure for scatter or measurement variability). However, when calculating the average of all samples in a group, the most useful measure of uncertainty is the confidence interval. Its interpretation is very straightforward; a 95% confidence interval around the group mean will contain the true population mean with 95% certainty.

5. Correction of the P-value is needed when testing multiple hypotheses

The generally accepted threshold (alpha value or type I error) of 0.05 to designate statistical significance is only appropriate when applying a single statistical test, to verify a single hypothesis. As soon as you are testing multiple hypotheses (e.g. assessing the differential expression of more than 1 target), the accepted threshold of 0.05 should be modified. In simple words, the more tests you run, the lower the threshold should be to keep the false positive rate under control. When 10 independent hypotheses are tested, there is a 40% probability of one or more P-values being less than 0.05 by chance (false positives). According to the conservative Šidák method for multiple testing correction, a threshold of 0.0051 (1-(1-0.05)^(1/10)) should be used to identify significant changes with a false positive rate of 5%. Less conservative methods (i.e. resulting in fewer false negatives) have been developed and are successfully being used (e.g. Benjamini and Hoghberg, 1995).

6. Independent biological replicates are required

To draw meaningful and reliable conclusions, independent biological replicates are required. The minimum number of such biological replicates depends on the statistical test and on the power one wants to achieve (e.g. for confidence interval analysis, at least 3 replicates are needed; for a non-parametric paired test (Wilcoxon signed-rank test), at least 6 pairs are needed; for a Mann-Whitney test, 8 data points for the 2 groups combined are minimally required). It must be clear that statistics on repeated measurements (e.g. replicates of the PCR reaction) are absolutely nonsense, as only technical variation is measured.

7. When in doubt, consult an expert bio-statistician

A good molecular biologist should have a decent background in data analysis and bio-statistics. This will become increasingly important as biology is turning into an information science. However, one cannot be an expert in every discipline. Therefore, for more advanced analyses, or whenever in doubt, consult with an expert in the field. Of note, as the saying goes that garbage in equals garbage out, pay careful attention to experiment design and execution; statisticians often lack the domain expertise or background to appreciate if the experiment was properly conducted, or if there was pairing information in your study design (see tip 2).


To make data analysis life easy for researchers using RT-qPCR, Biogazelle has developed an integrated statistical analysis wizard in its qPCR data analysis software qbase+.

Try qbase+ 14 days for free - all features, no limitations!

Completing the wizard in a question/answer style will automatically select and execute the appropriate test. The first 5 tips in this blog are integrated in the stat module. As such, data are log-transformed prior to doing the statistical test; the software recognizes the need to perform multiple testing correction of the P-value (using the Benjamini and Hochberg FDR method, 1995); non-parametric tests are proposed when sample size is limited; and confidence intervals are always calculated. More details on the stat module can be found in the qbase+ manual. More than 15 years of research experience and interaction with course participants indicates that the majority of gene expression questions can be addressed by the tests built into qbase+.

qbase+ statistical module: one way ANOVA

 [1] I obtained a master of science in bio-engineering at Ghent University in 1997.
[2] In principle, log transformation is not required when applying a rank based statistical test. However, I always recommend doing it, as it cannot hurt for a rank based test, and is absolutely required for a parametric test.

Topics: statistics- gene expression

Jo Vandesompele

Jo Vandesompele

Jo Vandesompele is co-founder and CSO of Biogazelle. He is also a professor in Functional Cancer Genomics and Applied Bioinformatics at Ghent University, Belgium. Jo obtained a Master of Science in Bioscience Engineering (1997) and a PhD in Medical Genetics (2002). He is author of more than 200 scientific articles in international journals, including some pioneering publications in the domain of qPCR based nucleic acid quantification.

Previous Post

Four tips for RT-qPCR data normalization using reference genes

Next Post

Novel splice junctions identified in RNA sequencing studies: noise or interesting biology?

Stay up to date

Subscribe and we'll send new blog posts directly to your inbox!

Subscribe to Email Updates

Newest posts