INTRODUCTION
Transposon insertion sequencing (TIS) (variously referred to as IN-Seq, Tn-Seq, HITS, and TraDIS) (reviewed in references 1 and 2) is a powerful forward genetics tool for the identification of genetic loci contributing to bacterial growth in diverse environments (3–6). Since its introduction in 2009, the method has been applied to a wide variety of bacterial species which have been exposed to an even broader array of growth conditions, ranging from chemically defined media to the complex and poorly characterized milieu of host tissues during bacterial infection (7, 8).
TIS methodologies rely on generating “libraries” that contain large numbers of transposon (Tn) insertion mutants, followed by high-throughput sequencing to identify insertion sites, which enumerates the relative abundance of individual Tn mutants within the library. Three common applications of TIS are (i) essential locus analysis, in which genes with disproportionately low frequencies of Tn insertion are identified to make inferences of genetic essentiality (9, 10); (ii) genetic interaction studies, in which differential frequencies of Tn insertion libraries generated in different genetic backgrounds are used to infer suppressor and synthetic lethal relationships (2, 11, 12); and (iii) sequential selection studies (the primary focus of this work), in which changes in the relative abundances of mutants within a library before and after imposition of a selective pressure are used to infer the importance of each locus for growth under the selective condition (1, 2, 4, 8). Typically, the Tn libraries include multiple unique mutants for each locus. Although each mutant can be analyzed independently, often, data for all insertion sites within a locus are combined to mitigate site-specific effects.
To quantify a genetic locus’ contribution to growth in a sequential selection experiment, TIS studies usually calculate the fold change (between input and output libraries) in the relative abundances of insertion mutants mapping to that locus. Fold change between input and output libraries is calculated as (reads per gene in output library)/(reads per gene in input library). Loci with a log2(fold change) of <0 are considered depleted, and those with log2(fold change) of >0 are considered enriched. Several statistical approaches for analyses of TIS data have been proposed (2, 5, 13, 14), with different methods for normalization, modeling read count data, and combining data from multiple insertion sites within a gene.
In contrast to the variety of approaches for quantifying a locus’ contribution to growth in a single sequential selection experiment, there are few methodologies for comparing multiple TIS data sets. Jensen et al. (15) described a technique to compare various TIS data sets by first normalizing the fold change values against an experimentally derived population expansion factor prior to performing comparisons across screens. This approach allows a comparison of TIS data sets that can be adequately normalized; however, measurement of the population expansion factor can be challenging, particularly in in vivo experiments. DeJesus et al. (16) used a hierarchical Bayesian approach to incorporate the variation of the fold change value of each gene to identify genes displaying statistically significant differences between two TIS screens. They showed that such an approach permits the study of genetic interactions by comparing the results of screens conducted in parallel in two genetically different strains constructed from a single parental strain. While these two approaches provide valuable new tools, they have limited ability to compare the results of multiple TIS data sets, as both methods are restricted to pairwise comparisons.
Here, we present comparative TIS (CompTIS), a novel framework for conducting comparisons of multiple TIS data sets that relies on the dimensional reduction approach of principal-component analysis (PCA). As an unsupervised technique, PCA makes no prior assumptions about the structure of the data sets, providing an unbiased and broadly applicable approach to discovery. Dimensional reduction approaches such as PCA transform multivariate data sets into smaller sets of summary parameters while maintaining the underlying structure of the data sets, facilitating a direct interpretation of the relationships between data sets. Although extensively used in transcriptome sequencing (RNA-seq) and microbiome (i.e., 16S) analyses, PCA and other dimensional reduction approaches that extract the sources of variation between various multivariate data sets have not been thoroughly explored for comparisons of multiple TIS data sets. Given that the structure of TIS and RNA-seq data sets, which are comprised of matrices of genes and associated fold changes, are highly similar, we developed a PCA-based dimensional reduction approach for the comparison of TIS data sets.
CompTIS begins by implementing “screen-level” PCA and clustering to depict the variation between different screens and the relatedness of TIS data sets. This first step enables the grouping of screens with similar results without prior knowledge of experimental conditions, allows the identification of outlier screens, and facilitates the selection of comparable data sets for a subsequent “gene-level” implementation of PCA. Gene-level PCA examines variance across genes indicative of mutant growth phenotypes that are either consistent or divergent across TIS studies in order to identify genes that are important for growth under specific conditions or combinations of conditions. Here, we applied CompTIS to in vitro and several in vivo TIS data sets derived from studies of different pathogenic vibrio species and strains. This approach is not restricted to pairwise comparisons of TIS data sets and is not dependent on a specific upstream analysis method. CompTIS provides a general framework for unsupervised data discovery and meta-analyses of TIS studies.
RESULTS AND DISCUSSION
Background considerations.
Deriving biological insight from TIS data sets is complicated by their high dimensionality. Suppose we have k screens (in this work, our data sets contain up to almost a dozen screens, although larger ones are available [8]), each of which measures the fold change of N genes (typically in the thousands). For visualization purposes, we can represent each screen as one of k points in N-dimensional space, where the position along the nth axis is the log2(fold change) (L2FC) value of the nth gene. PCA identifies the line in the N-dimensional space along which there is the greatest variance among the k screens. Each screen is assigned a first principal component score (PC1), which is the position along this axis of greatest variance, and represents a weighted sum of the L2FC values for each gene. To compute the second principal component score, lines perpendicular to the line of greatest variance are identified (perpendicular so that variation in one principal component is independent from the others), and again, the line along which there is the greatest variance among the k screens is selected; the second principal component score for each screen is its position along this axis. The process is repeated, each time selecting the axis of maximum variance, subject to the constraint that it be perpendicular to all previous axes of maximum variance. Ideally, the variation in each screen can be accurately reconstructed by the first several principal components since they exhibit the greatest variance, and higher principal components can be dropped with little loss of accuracy.
We term the approach described above, assigning each of k screens as a point in N-dimensional gene space, a screen-level approach. Alternatively, we could assign each of N genes a point in k-dimensional screen space, which we term a gene-level approach. For the gene-level approach, we can also apply PCA, identifying the direction (in k-dimensional screen space) along which there is the greatest variation across genes and proceeding analogously. Whereas the screen-level approach facilitates the identification of patterns among screens or identification of screens of interest (outliers, for instance), the gene-level approach highlights patterns among genes and enables the selection of genes of interest.
Screen-level PCA and clustering of TIS screens identify variation among replica screens and distinguish screens performed under different conditions.
We examined whether TIS data were amenable to PCA-based dimensional reduction and hierarchical agglomerative clustering by analyzing published data sets from sequential selection experiments. The data were derived from five screens performed with a high complexity
We performed screen-level PCA to analyze the L2FC data for each variable (i.e., all of
FIG 1
Screen-level comparative TIS analysis of
The relatedness of the screens was also assessed via hierarchical agglomerative clustering of L2FC values, which provided additional support for the PCA-based groupings. We used a bootstrapping approach to determine the statistical support for the separation of the in vivo and in vitro data sets (Fig. 1B and C). Both approaches demonstrate that the variance between the in vitro observation and the four in vivo observations exceeds the variance between each of the 4 in vivo observations. Thus, PCA and clustering analyses of gene fold change values from multiple TIS data sets can reveal the relatedness of multiple sequential selection screens. These analyses could also potentially identify batch artifacts in the replicates, such as technical variation introduced during library preparation or sequencing, as they would appear as outliers. Both PCA- and clustering-based analyses of screens have merit. Clustering provides a bootstrap value to evaluate the robustness of each cluster, while PCA provides a more intuitive visualization of the relatedness of screens and/or replicates, particularly when analyzing a large number of screens.
Gene-level PCA allows for integration of data across biological replicates of a screen.
Screen-level PCA enabled the visualization of the relatedness of TIS data sets and screens; however, it does not provide information about relationships among genes. To identify sets of genes whose mutants exhibit similar patterns across different screens, we implemented gene-level PCA of L2FC values from the
Gene-level PCA provides principal-component scores for each gene, which are weighted sums of the L2FC measurements across the biological replicates analyzed. Each principal component has an associated set of weighting coefficients, which determine the contribution of each sample to the overall score per gene. The first principal component identified by gene-level PCA (PC1) accounted for 78% of the overall variance, while principal components 2 to 4 appeared to each account for a similar small amount of the remaining variance (Fig. 2A). Thus, our data were approximately one dimensional and hence well captured by a single quantity, PC1, for each gene. For PC1, the coefficients for each replicate had the same sign and were of similar magnitudes (Fig. 2B), so that the L2FC measurements from each sample contributed similarly to each gene’s PC1 score; that is, PC1 is a weighted average. The roughly equal weights of each screen are consistent with our expectation that gene-level L2FC measurements will be relatively consistent across biological replicates. Thus, PCA facilitates the comparison of replicates of screens by using weights informed by the data.
FIG 2
Gene-level comparative TIS analysis of
Most of
Screen-level PCA reveals the relatedness of in vivo screens from different vibrio strains.
To determine whether CompTIS could be applied to more distantly related data sets, we repeated the screen-level and gene-level analyses described above after incorporating data from 7 additional in vivo screens (13, 19). Three of these screens utilized a transposon library constructed in
We wondered whether the screen-level PCA would be able to discern two anticipated results. First, that biological replicates of the same library exhibit more similarity in mutant growth phenotypes than those of distinct bacterial strains; and second, that the two
To assess the relatedness of these 11 data sets, we performed screen-level PCA to analyze the L2FC measurements for each variable (that is, conserved gene) across all observations (11 screens). In screen-level PCA, the first and second principal components accounted for 72% and 11%, respectively, of the overall variance. PC1 and PC2 values separated the data into 3 groups based on both species (
FIG 3
PCA-based analyses of in vivo TIS data from 3 pathogenic vibrio strains. (A) Screen-level PCA of
Gene-level PCA identifies both strain-independent and strain-dependent mutant growth phenotypes.
Although
Heatmaps were generated to visualize the lowest 1% of PC1 scores (Fig. 4A). These genes generally exhibited negative L2FC measurements across all 11 data sets and thus display strain-independent attenuation in vivo. Functional analyses revealed that the majority of these genes are involved in de novo purine and pyrimidine nucleotide synthesis as well as complex 1 of the electron transport chain (Fig. 4B). These observations suggest that access to nucleotides in the small intestine is limited for both pathogens, even though they modify the host environment in distinct ways; e.g.,
FIG 4
Gene-level PC1 and PC2 identify genes required for colonization by all strains and by specific strains, respectively. (A) Heatmap of log2(fold change) values of genes with the 1% lowest of gene-level PC1 scores across 11 in vivo vibrio screens. (B) Categories highly represented among the genes with the lowest 1% of PC1 scores (25 genes total). (C) Heatmap of a subset of genes with discordant L2FC values across strains, selected from genes with the lowest 1% or highest 1% of gene-level PC2 scores.
Summary and conclusions.
We developed CompTIS, which utilizes screen-level and gene-level PCA and clustering, to accomplish meta-analysis of TIS data. Screen-level PCA distilled genome-wide mutant growth phenotypes to facilitate comparisons across screens. This unsupervised learning method was capable of establishing the relatedness of screens, distinguishing replicate screens from those conducted in different experimental contexts, and identifying outlier screens. Furthermore, clustering analysis with bootstrapping corroborated the PCA analysis and enabled the identification of statistically significant clusters. Using such an approach, we detected differences in the genetic requirements for intestinal colonization in two closely related strains of
The second part of our approach relied on using gene-level PCA to identify variance across genes indicative of mutant growth phenotypes that are either consistent or divergent across multiple screens. Importantly, gene-level PCA does not depend on a priori hypotheses regarding consistency or divergence of mutant growth phenotypes across screens for the identification of significant gene sets. Instead, the utility of gene-level PCA lies in its capacity to guide the formation of hypotheses regarding the genes that modulate growth, both in biological replicates and in separate strains and environments.
In summary, our findings suggest that a PCA- and clustering-based analytic approach provides a straightforward method for comparing the results of different TIS screens, thereby facilitating the discovery of novel associations between screens and guiding hypothesis development for additional experimentation.
MATERIALS AND METHODS
Weighting of Con-ARTIST log2(fold change) measurements.
We used previously published TIS screens for our analyses (13, 17, 19). To minimize the influence of noise due to variability in log2(fold change) (L2FC) measurements observed across genes with few unique insertion mutants, the L2FC measurement for each gene was weighted based on the variability observed in genes with similar numbers of unique insertion mutants. This procedure ensured that low-variability (i.e., high-confidence) observations were given proportionally higher weights than those with higher variability (see Fig. S1 in the supplemental material).
For each screen, we calculated the standard deviation of each gene’s L2FC value by comparing the L2FC values calculated for each gene across the 100 independently simulated input libraries that are generated during the Con-ARTIST analysis (13). Note that for our data sets, for each gene, the average fold change is calculated by averaging the ratio, (reads per gene in output library)/(reads per gene in simulated input library), across all simulated input libraries. The input libraries are generated via multinomial-based resampling in order to model stochastic drift, i.e., a bottleneck, in the input library, hence limiting the effect of genetic drift on downstream analysis and reducing the number of false-positive findings (13). We fit a power law function (y = axb) to the standard deviation of each gene’s L2FC value and the number of unique insertion mutants represented in each gene. Fitting was performed using the Fit function in Matlab (Curve Fitting Toolbox) with the following parameters: power1 and name-value pair Robust and Bisquare. We found that a function with b of ∼−2 fit the data well. For each screen, each gene’s weight was calculated by first using the generated coefficients to determine the estimated standard deviation in L2FC values based on the number of unique insertion mutants present for the gene and then taking the inverse of the estimated standard deviation (i.e., for gene q, its weight, wq = 1/(axb); where x is the number of unique insertion mutants present for the gene).
Principal-component analysis.
Prior to performing PCA, we removed genes from the analysis that contained one or more uncalculated L2FC values (e.g., arising when there were no reads mapping to the gene in a particular screen). Next, the L2FC values in each screen were standardized (i.e., z-score normalized) using the zscore function in Matlab. In this final normalized L2FC matrix, which was used for PCA analyses, rows corresponded to genes and columns corresponded to screens.
(i) Screen-level PCA. Weighted PCA was performed in Matlab using the PCA function with the default algorithm (single value decomposition [svd]), “centered” set to off, “VariableWeights” corresponding to a column vector of the sum of the calculated weights of each gene across the screens being analyzed, and “Weights” corresponding to a row vector of the sum of the calculated weights of all the genes in each screen. Screen-level PCA was performed on the transpose of the normalized L2FC matrix.
(ii) Gene-level PCA. Weighted PCA was performed in Matlab using the PCA function with the default algorithm (single value decomposition [svd]), “centered” set to off, “VariableWeights” corresponding to a row vector of the sum of the calculated weights of all the genes in each screen, and “Weights” corresponding to a column vector of the sum of the calculated weights of each gene across the screens being analyzed. Gene-level PCA was performed directly on the normalized L2FC matrix.
Clustering and bootstrapping analysis.
We used the normalized L2FC matrix to perform hierarchical agglomerative clustering with bootstrapping using the pvclust package (version 2.0-0) (24, 25) in R (version 3.3.2) (26) and the following parameters: distance function, Euclidean; clustering method, Ward's (ward.D2); and n = 1,000 bootstrap replications. pvclust provides two P values, the standard bootstrap probability and the adjusted unbiased (AU) value, which is calculated using multiscale bootstrap resampling and represents a more unbiased P value than the bootstrap probability.
Identification of conserved vibrio genes.
The
Cluster of orthologous groups analysis.
COG analysis of the
Data availability.
Matlab scripts for running the screen-level and gene-level PCA analyses can be accessed at https://bitbucket.org/gabriel_billings/comptis.
b Division of Infectious Diseases, Brigham & Women’s Hospital, Boston, Massachusetts, USA
c Howard Hughes Medical Institute, Boston, Massachusetts, USA
University of Michigan–Ann Arbor
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Copyright © 2019 Hubbard et al. This work is published under https://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
ABSTRACT
Transposon insertion sequencing (TIS) is a widely used technique for conducting genome-scale forward genetic screens in bacteria. However, few methods enable comparison of TIS data across multiple replicates of a screen or across independent screens, including screens performed in different organisms. Here, we introduce a post hoc analytic framework, comparative TIS (CompTIS), which utilizes unsupervised learning to enable meta-analysis of multiple TIS data sets. CompTIS first implements screen-level principal-component analysis (PCA) and clustering to identify variation between the TIS screens. This initial screen-level analysis facilitates the selection of related screens for additional analyses, reveals the relatedness of complex environments based on growth phenotypes measured by TIS, and provides a useful quality control step. Subsequently, PCA is performed on genes to identify loci whose corresponding mutants lead to concordant/discordant phenotypes across all or in a subset of screens. We used CompTIS to analyze published intestinal colonization TIS data sets from two vibrio species. Gene-level analyses identified both pan-vibrio genes required for intestinal colonization and conserved genes that displayed species-specific requirements. CompTIS is applicable to virtually any combination of TIS screens and can be implemented without regard to either the number of screens or the methods used for upstream data analysis.
IMPORTANCE Forward genetic screens are powerful tools for functional genomics. The comparison of similar forward genetic screens performed in different organisms enables the identification of genes with similar or different phenotypes across organisms. Transposon insertion sequencing is a widely used method for conducting genome-scale forward genetic screens in bacteria, yet few bioinformatic approaches have been developed to compare the results of screen replicates and different screens conducted across species or strains. Here, we used principal-component analysis (PCA) and hierarchical clustering, two unsupervised learning approaches, to analyze the relatedness of multiple in vivo screens of pathogenic vibrios. This analytic framework reveals both shared pan-vibrio requirements for intestinal colonization and strain-specific dependencies. Our findings suggest that PCA-based analytics will be a straightforward widely applicable approach for comparing diverse transposon insertion sequencing screens.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer