In the US FDA-led SEQC (aka MAQC-III) study, different sequencing platforms were tested across more than ten sites using well established reference RNA samples with built-in truths in order to assess the discovery and expression profiling performances of platforms and analysis pipelines (). The entire SEQC data set comprises over 100 billion reads (10 Tb) thus providing a unique resource for thorough assessment of RNA-seq performance. Biogazelle co-authored and complemented this study by defining the first human transcriptome using well-established RT-qPCR technology. Using almost 21,000 PrimePCR qPCR assays (jointly developed by Biogazelle and Bio-Rad), the mRNA expression repertoire of the four MAQC samples was established.
RNA-seq and RT-qPCR are both valuable technologies for gene expression analysis, each with their own strengths and limitations. The main benefits of RNA-seq are the broad scope of genes being interrogated, its compatibility with allele and transcript specific RNA quantification, and the possibility to discover hitherto unknown transcripts. RT-qPCR on its turn is the technology par excellence for sensitive RNA quantification of a targeted set of genes. Of note, the cost dynamics of both technologies are based on completely different properties. Whereas qPCR cost scales in function of the number of genes being evaluated, RNA sequencing cost scales in function of the required sensitivity. It is thus important to understand the relationship between sensitivity and read depth.
Here, we take advantage of the largest datasets for both RNA-seq and RT-qPCR to assess RNA-sequencing sensitivity in function or read depth. At 6 billion reads for the MAQC-A sample, there is a good concordance (r² = 0.75) between RNA-seq (Y-axis, number of reads) and RT-qPCR (X-axis, Cq value) without obvious signs of nonlinearity or truncation (Figure 1). By subsampling reads from this large dataset we could evaluate the impact of reduced sequence coverage (100M (million), 50M and 10M reads) on detection sensitivity. At 100M reads the truncation is noticeable but not too severe. At 50M reads, and especially at 10M reads, the truncation becomes quite pronounced.
In an attempt to quantify the impact of reduced coverage we calculated the fraction of genes that is not detected by either RT-qPCR (pink), RNA-seq (blue) or both (grey) (Figure 2 top). At 6 billion reads RNA-seq and RT-qPCR appear to be on par in terms of the number of genes detected. Each technology fails to detect approximately 3% of genes. At 100M reads, somewhat more genes are not detected by RNA-seq; this effect becomes more pronounced as the coverage is further reduced to 50M or even 10M reads.
More relevant than the fraction of genes being detected is the fraction of genes whose expression level can be quantified. For these analyses a somewhat arbitrary cut off for limit of quantification (LOQ) was set at 8 reads (for RNA-seq) and 8 molecules (3 cycles lower than the single molecule Cq value (i.e. 35 in our setting)). At 6 billion reads, RNA-seq outcompetes RT-qPCR in terms of quantitative power (Figure 2 bottom). At 100M reads, the opposite effect is observed. Lower coverage further reduces the quantitative power of RNA-seq.
These results clearly indicate that the detection and quantification sensitivity of RNA-seq is very much depending on the read depth. At extremely high coverage (6 billion in this example), RNA-seq has better performance than RT-qPCR. At an affordable 100M coverage, RNA-seq begins to suffer from reduced quantification and detection sensitivity (compared to RT-qPCR). In view of the variable experimental needs and the fact that not everyone may be interested in low abundant genes, no definite guidelines can be given on required sequence coverage. Nevertheless, our analyses and benchmark data may provide reference numbers to better plan future RNA-seq studies in terms of intended sensitivity.