What is a gene? Perhaps a simple question, but no clear answer is given by geneticists. Although the concept of a gene has been shown to be much more complex than ‘a DNA sequence transcribed into RNA and translated into a single protein product’, scientists often ignore this complexity. For example, while the majority of human genes show differential splicing, expression analysis is typically performed in function of genes, not transcripts (PubMed for instance has more than 800,000 ‘gene expression’ papers but only around 2,000 ‘transcript expression’ papers). I do believe that most of the authors of these papers are not ignorant of the concept of alternative splicing, but rather that transcripts are harder to study because less is known about specific transcripts functions than about gene functions, and because transcript analysis used to be more challenging from a technological perspective.
RNA-seq really enables us to perform decent transcript analysis. Not only to discover and describe novel transcripts, but also to routinely perform expression analysis at the transcript level. Transcript expression analysis requires a greater read depth and more complex analyses but gives more accurate insights in what is really going on in the transcriptome. However, some skeptics wonder whether the newly discovered transcripts, especially those that are expressed at low levels, are real and biologically relevant and not just a glitch from our cellular machinery or sequence methodology and data-analysis.
To shed some light on this question Biogazelle participated in the SEQC study by validating a series of newly identified transcripts by means of quantitative PCR. The Sequencing Quality Control (SEQC aka MAQC-III) study aims to assess the technical performance of next-generation sequencing platforms by generating benchmark datasets with reference samples and evaluating advantages and limitations of various bioinformatics strategies in RNA and DNA analyses. We selected 160 splice junctions with widely varying sequence coverage and differing in the level of support by the various algorithms used to identify the novel junctions (CStar, Magic, Subread) for qPCR validation. Of note, qPCR based transcript analysis is a lot more challenging than conventional gene centric expression analysis. Previously, we published that PCR specificity should come from the primer sequences, and not from the detection probe (
). The top Figure shows a model of an alternatively spliced gene with exons represented by blue boxes, introns by continuous blue lines and the splicing event leading to the novel junction as a blue dotted line. Confirmation of the first transcript by PCR can be done easily because the combination of the junction flanking exons is unique, allowing the design of a primer set uniquely amplifying transcript variant 1. Confirmation of the second transcript is more challenging because the combination of flanking exons occurs in other transcripts as well. Qualitative confirmation can be achieved by combining a primer set that does not uniquely amplify the second transcript (2a) with amplicon length analysis to confirm which transcripts are present. For quantitative (qPCR) analysis, a primer set unique for the novel junction in transcript two needs to be designed. To avoid co-amplification of other transcripts, we require that the primer overlaps the junction with at least 5 bases at its 3’ end and 7 bases at its 5’ end.
Using our in-house PCR assay design pipeline primerXL, qPCR assays were designed for 160 junctions. Following wet-lab testing, we obtained a very high confirmation rate using qPCR supplemented by amplicon size analysis when needed. All of the novel junctions predicted by multiple algorithms could be confirmed by qPCR. Even for junctions predicted by just a single algorithm, a 83% confirmation rate was obtained. In the meantime, several of these junctions have appeared in the Ensembl database, further strengthening that these are genuine alternative transcripts and not a consequence of erroneous data processing.
Table 1: RT-qPCR confirmation rate of novel splice variants identified by RNA sequencing in the SEQC study (
).
junction prediction | number of junctions | number confirmed | confirmation rate |
multiple algorithms | 136 | 136 | 100% |
single algorithms | 24 | 20 |
With the technology to identify novel junctions (i.e. RNA-seq) and to analyze the expression of these specific transcripts in larger sample sets (using RT-qPCR), we are now ready to further investigate the biological meaning of these specific transcripts. Exciting times ahead of us!