Content area
Full Text
BRIEF COMMUNICATIONS
Differential abundance analysis for microbial marker-gene surveys
npg 2013 Nature America, Inc. All rights reserved.
Joseph N Paulson1,2, O Colin Stine3, Hctor Corrada Bravo1,2,4 & Mihai Pop1,2,4
We introduce a methodology to assess differential abundance in sparse high-throughput microbial marker-gene survey data. Our approach, implemented in the metagenomeSeq Bioconductor package, relies on a novel normalization technique and a statistical model that accounts for undersamplinga common feature of large-scale marker- gene studies. Using simulated data and several published microbiota data sets, we show that metagenomeSeq outperforms the tools currently used in this eld.
Marker-gene surveys have recently been applied to clinical settings in order to understand the structure and function of healthy microbial communities and the association of microbiota with diseases such as Crohns disease1, bacterial vaginosis2, diabetes3, eczema4, obesity5 and periodontal disease6. Potentially pathogenic or probiotic bacteria can be identified by detecting significant differences in their distribution across healthy and disease populations, thereby making the analysis of differential abundance critical. Similar issues are encountered in the attempt to correlate microbiome composition with environmental factors. Although methods for comparing whole communities are commonly used in this context7,8, there is a need for tools that discern taxon-specific associations in marker-gene surveys. We present a method that provides this level of resolution while removing biases that exist in current approaches.
The 16S ribosomal RNA gene is a commonly used marker for profiling diversity in microbial samples. Hypervariable regions within the gene are amplified and sequenced, and sequence reads are clustered into operational taxonomic units (OTUs)9. Representative sequences from each cluster are then classified taxonomically by alignment against a database of previously characterized 16S ribosomal DNA (rDNA) reference sequences10.
Although data preprocessing and differential abundance analysis have been extensively studied in microarray and high-throughput sequencing (serial analysis of gene expression, or SAGE, and
1Graduate Program in Applied Mathematics & Statistics, and Scientific Computation, University of Maryland, College Park, Maryland, USA. 2Center for Bioinformatics and Computational Biology, University of Maryland, College Park, Maryland, USA. 3Department of Epidemiology and Public Health, University of Maryland School of Medicine, Baltimore, Maryland, USA. 4Department of Computer Science, University of Maryland, College Park, Maryland, USA. Correspondence should be addressed to M.P. ([email protected]) or H.C.B. ([email protected]).
RECEIVED 4 MARCH; ACCEPTED 16 AUGUST; PUBLISHED ONLINE...