Content area
Full Text
L E T T E R S
Modeling of RNA-seq fragment sequence bias reduces systematic errors in transcript abundance estimation
Michael I Love1,2, John B Hogenesch3 & Rafael A Irizarry1,2
2016 Nature America, Inc., part of Springer Nature. All rights reserved.
We find that current computational methods for estimating transcript abundance from RNA-seq data can lead to hundreds of false-positive results. We show that these systematic errors stem largely from a failure to model fragment GC content bias. Sample-specific biases associated with fragment sequence features lead to misidentification of transcript isoforms.
We introduce alpine, a method for estimating sample-specific bias-corrected transcript abundance. By incorporating fragment sequence features, alpine greatly increases the accuracy of transcript abundance estimates, enabling a fourfold reduction in the number of false positives for reported changes in expression compared with Cufflinks. Using simulated data,we also show that alpine retains the ability to discover true positives, similar to other approaches. The method is available as an R/Bioconductor package that includes data visualization tools useful for bias discovery.
Obtaining transcript abundance information from RNA sequencing (RNA-seq) experiments relies on complex methods implemented in software such as Cufflinks1 and RSEM (RNA-seq by expectation- maximization)2. However, RNA-seq data can suffer from sample-specific biases as a result of RNA extraction and library preparation steps35. Methods for estimating gene and transcript abundance attempt to mitigate the effect of technical biases by estimating sample-specific bias parameters. For gene-level expression, common normalization methods use the GC content and length of the gene68, or identify batch effects by detecting structure in expression measurements that are common across genes and not associated with the experimental design911. At the transcript level, sample-specific biases that current methods correct for include the fragment length distribution induced by size selection, positional bias along the transcript due to RNA degradation and mRNA selection techniques, and sequence-based bias in read start positions arising from the differential binding efficiency of random hexamer primers2,1216 (Fig. 1a and Supplementary Table 1). Even so, it is common to observe extreme variability in the coverage of RNA-seq fragments along transcripts that is purely technical and sample-specific17 and not explained by current bias models, which confounds current methods designed to identify and quantify transcripts18 (Fig. 1b).
To investigate the cause of systematic errors in...