- BIC
- Bayesian Information Criterion
- DAPC
- discriminant analysis of principal components
- df
- degrees of freedom
- GWAS
- genome-wide association studies
- h2
- broad-sense heritability
- LDA
- linear discriminant analysis
- P3D
- population parameters previously determined
- PCA
- principal component analysis
- PEV
- prediction error variance
- REML
- restricted maximum likelihood
- SolCAP
- Solanaceae Coordinated Agricultural Project
- QTL
- quantitative trait locus (or loci)
- SNP
- single nucleotide polymorphism
Abbreviations
Genome-wide association studies have become commonplace in diploid plant species as an approach to discovering causal variants. Compared with linkage mapping in biparental crosses, the ability to analyze more diverse germplasm by GWAS promotes the identification of variants with a consistent effect across the discovery population (Myles et al., 2009). Another reason for the intense interest in GWAS is the ability to utilize existing phenotypic and/or genotypic data, which promotes larger population sizes and therefore higher statistical power. Provided sufficient marker density is available, the use of diverse mapping populations with shorter-range linkage disequilibrium also enables finer resolution of QTL positions.
The statistical methods used for GWAS have evolved considerably over the past 15 yr (Balding, 2006; Li et al., 2014). One of the earliest approaches was the transmission disequilibrium test (TDT), which relied on parent-offspring relationships to identify significant associations (Spielman et al., 1993; Allison, 1997). To move beyond familial data, methods were developed to identify subpopulations from the markers and then use these groups as covariates in the analysis. The STRUCTURE (Pritchard et al., 2000) and EIGENSTRAT (Price et al., 2006) programs, which produce matrices typically denoted Q and P, respectively, are examples of this approach. Yu et al. (2006) demonstrated the value of including a random polygenic effect in the model, with covariance proportional to a marker-estimated kinship (K) matrix. As a result of computational innovations (Kang et al., 2008, 2010; Zhang et al., 2010) and the availability of several software packages (Bradbury et al., 2007; Endelman, 2011; Zhou and Stephens, 2012; Lipka et al., 2012), the Q + K method is now the most widely used single-marker test for GWAS in diploid species.
A few research groups have investigated the use of Q + K models for GWAS in autopolyploid species, particularly autotetraploid potato (Solanum tuberosum, 2n = 4x = 48). In some cases, diploid models and software have been used because (i) the markers were dominant (Malosetti et al., 2007), or (ii) the markers were codominant but the allele dosage for heterozygotes could not be experimentally determined (Li et al., 2011). Even when tetraploid allele dosage is available, it may be disregarded (i.e., the marker data are “diploidized”) to facilitate the use of diploid software (Simko et al., 2006). The first use of tetraploid marker data and association analysis models in potato was in studies of candidate genes (Pajerowska-Mukhtar et al., 2009; Stich and Gebhardt, 2011), and these methods were later extended to true genome-wide analyses (Uitdewilligen et al., 2013). Thus far, however, no one has released polyploid GWAS software targeted to the plant breeding and genetics community.
We report here on the development and validation of R package GWASpoly, which was designed for GWAS with biallelic single nucleotide polymorphisms (SNPs) in autopolyploids using the Q (or P) + K method. One design objective for the software was to incorporate different models of gene action: for ploidy level N there are potentially N degrees of freedom (df) for the single marker test. Another goal was to investigate different types of kinship (K) models for autopolyploids. A number of approaches to modeling kinship have been used in diploids (Bradbury et al., 2007; Zhao et al., 2007; Stich et al., 2008; Kang et al., 2008; Endelman, 2011), but no single method has emerged as clearly superior.
In addition to demonstrating the performance of the software with simulated data, we also present GWAS results for a tetraploid potato diversity panel, which was genotyped and phenotyped as part of the USDA-NIFA SolCAP (Hirsch et al., 2013). The availability of a reasonably priced Infinium array for potato, which originally had 8303 SNPs (Hamilton et al., 2011; Felcher et al., 2012) but has now been extended to more than 12,000, and the ability to reliably call tetraploid dosage for such markers (Voorrips et al., 2011; Hirsch et al., 2013) have created an urgent need for tetraploid GWAS software.
Materials and Methods
Q + K Model for Autotetraploids
The Q + K linear mixed model for GWAS can be written as (Yu et al., 2006; Kang et al., 2008):
For biallelic SNPs in autotetraploids, there are five genotype classes which can be parameterized by the dosage of the minor allele: {0, 1, 2, 3, 4}. The most general type of genetic model allows the fixed effect for each genotype class to be arbitrary. Because it is only the difference between the levels of the fixed effect that matter for the F-test, there are 4 df for this model (one less than the number of genotype classes = ploidy level). A number of parameterizations for the general model are possible (Gallais, 2003; Pajerowska-Mukhtar et al., 2009), but for the purpose of the F-test, they are statistically equivalent.
In addition to the general model, we present results for four different single-parameter genetic models, which are depicted in Fig. 1: additive, simplex dominant, duplex dominant, and diploidized additive. In the additive model, the SNP effect is proportional to the dosage of the minor allele. In the simplex dominant model, all three heterozygotes are equivalent to one of the homozygotes; as there are two homozygous classes, there are two nonequivalent simplex dominant parameterizations for each marker. There are also two nonequivalent duplex dominant models for each marker, in which the duplex state (AABB) has a common effect with either the simplex (AAAB) and nulliplex (AAAA) states, or with the triplex (ABBB) and quadriplex (BBBB) states. In the diploidized additive model, all three heterozygous classes are equivalent and exactly halfway between the two homozygotes. We note that, for the simplex dominant and diploidized additive models, the mapping from genotype state to SNP effect is the same regardless of whether the genotypic data are diploidized (called as {0, 1, 2}, failing to differentiate between the heterozygotes) or if tetraploid dosage is called. The additive and duplex dominant models require tetraploid genotype data.
[IMAGE OMITTED. SEE PDF]
Three tetraploid kinship (or relationship) models were compared. The first is the canonical relationship matrix used in genome-wide prediction studies (VanRaden, 2008), which we call the realized relationship model:
A key diagnostic for GWAS is a quantile-quantile plot of the observed vs. expected –log p values, which should follow a uniform distribution under the null hypothesis. The inflation of p-values above the y = x line in such a plot is an indicator of the failure of the model to control for population structure. Inflation was quantified by the linear regression coefficient of the observed vs. expected –log10 p-values, denoted λ, which has a value of 1 under the null hypothesis (Riedelsheimer et al., 2012).
The average inflation across different GWAS models and traits was compared by analysis of variance, according to
Three different methods are available in GWASpoly for establishing a p-value detection threshold for statistical significance. The first is the Bonferroni correction, which uses a threshold of α/m to ensure the genome-wide type I error with m markers is no greater than α. The second approach is the random permutation test, in which phenotypes are randomly permuted to explicitly construct the genome-wide null distribution of p-values (Churchill and Doerge, 1994). The third option uses the qvalue package (Storey and Tibshirani, 2003) to control the genome-wide false discovery rate (rather than Type I error = probability of false positive). For the simulations, due to their computationally intensive nature, we used the Bonferroni correction with α = 0.05. For the analysis of the real potato data, we used the permutation test with 1000 permutations and genome-wide α = 0.05.
Simulated Populations
Simulated populations and phenotypes were used to validate the software. Random mating autotetraploid populations were simulated using the software PedigreeSim (Voorrips and Maliepaard, 2012), according to the scheme illustrated in Supplemental Fig. S1. The base population consisted of five individuals, from which 10 mating pairs were randomly selected, and 10 progeny per pair were randomly generated to create a population of 100 individuals in Generation 1. In Generations 2 through 999, 100 mating pairs were randomly selected, each contributing 1 offspring, to keep the population size constant at 100. For the last (1000th) generation, N mating pairs were randomly selected, each contributing one offspring to create a population of size N. Results are shown for N = 200, 400, and 600. The simulated genome contained three chromosomes, each 100 cM in length, with 100 loci per cM. Recombination was simulated according to Haldane's mapping function, using the default meiosis parameters governing the formation of quadrivalents. Marker density was varied by subsampling loci (m = 3, 10, 50 per centiMorgan).
To estimate power in each simulated population, one marker was randomly designated as the causal QTL and the remaining markers were converted to biallelic SNPs by randomly assigning the 20 founder alleles to biallelic states (A/B), thereby creating markers with an average minor allele frequency of 0.5. Two different schemes were used to simulate genotypic values. In the first, the causal QTL was also converted to a biallelic locus as above, and allelic effects were sampled from the standard normal distribution. This scheme was used to generate Tables 1 and 2. In the second scheme, which was used for Fig. 2, each of the 20 founder alleles was assigned a different effect, drawn from the standard normal distribution. The phenotypic value for each genotype was the sum of its genotypic value and a random deviate, with variance chosen such that the ratio between the genetic and phenotypic variances of the population was h2 = 0.3. Because there were no subpopulations in the simulated population, we used a K-only GWAS model with the realized relationship matrix. A QTL was considered detected if a SNP within 5 cM of the unobserved QTL had –log p-value above the significance threshold. Conversely, significant markers greater than 5 cM from the QTL were considered false positives. We report the average power and false positive rate based on 1000 replications, with standard errors computed from the binomial distribution.
Table 1 Comparison of the full mixed model (variance components estimated for each marker) vs. the P3D approximation (variance components estimated once) on the statistical power and false positive rate (FPR) in a simulated autotetraploid population with 400 individuals, 600 markers per 100 cM chromosome, and h2 = 0.3. The standard errors for power and FPR were less than 0.015 and 0.008, respectively.
Gene action | Full model | P3D model |
Additive | 0.93 (FPR 0.03) | 0.90 (FPR 0.00) |
Simplex dominant | 0.37 (FPR 0.06) | 0.36 (FPR 0.03) |
Duplex dominant | 0.89 (FPR 0.04) | 0.84 (FPR 0.00) |
Table 2 Effect of true vs. assumed (model) gene action on the statistical power and false positive rate (FPR) in a simulated autotetraploid population with 400 individuals, 600 markers per 100 cM chromosome, and h2 = 0.3. The standard errors for power and FPR were less than 0.008.
Model | True gene action | ||
Additive | Simplex dominant | Duplex dominant | |
Additive | 0.94 (FPR 0.01) | 0.39 (FPR 0.02) | 0.78 (FPR 0.02) |
Simplex dominant | 0.48 (FPR 0.03) | 0.67 (FPR 0.03) | 0.11 (FPR 0.04) |
Duplex dominant | 0.68 (FPR 0.03) | 0.21 (FPR 0.07) | 0.86 (FPR 0.03) |
Diploidized additive | 0.75 (FPR 0.02) | 0.30 (FPR 0.02) | 0.06 (FPR 0.03) |
General | 0.42 (FPR 0.06) | 0.11 (FPR 0.04) | 0.13 (FPR 0.06) |
[IMAGE OMITTED. SEE PDF]
Potato Diversity Panel
The genotypic and phenotypic data were collected as part of the SolCAP. The SolCAP potato diversity panel consists of both diploid and tetraploid wild species, genetic stocks, and cultivated potato lines with release dates ranging from 1857 to 2011 (Hirsch et al., 2013). The panel was genotyped with an Infinium SNP array of 8303 markers (Hamilton et al., 2011; Felcher et al., 2012), and tetraploid marker dosage was determined by Hirsch et al. (2013), principally by visual inspection of the cluster boundaries. Our analysis of population structure was conducted using all 221 tetraploid lines in the panel (Supplemental Table S1), while GWAS results are based on the 187 tetraploid lines with both marker and phenotypic data.
Broad-sense heritability and GWAS results are presented for thirteen quantitative traits, which were measured in up to four environments (New York-2010, Wisconsin-2010, New York-2011, Wisconsin-2011). A randomized, complete block design with two replicates was used in each environment, although not all traits were measured in every environment (the number of environments per trait is shown in Table 4). In addition to the four traits analyzed by Hirsch et al. (2013), which were chip color (1–5 scale), tuber shape (1–5 scale), tuber sucrose and glucose (milligrams gram–1 fresh wt.), we present GWAS results for total yield (kilograms), tuber size and eye depth (1–9 visual scale), vine maturity 95 and 120 d after planting (1–9 visual scale), tuber length (millimeters), tuber width (millimeters), tuber fructose and malic acid content (milligrams gram–1 fresh wt.). Phenotypic data were analyzed with the following linear model
Three different population structure matrices (Q) were compared. The first corresponds to the four subpopulations identified with the program STRUCTURE (Pritchard et al., 2000), as reported by Hirsch et al. (2013). The second matrix was constructed from a principal component analysis (PCA), using centered and scaled marker scores (Price et al., 2006). Since a scree plot of the cumulative percent variation vs. model complexity (Supplemental Fig. S2) showed a gradual increase and no obvious choice for a low-dimensional model, we used four principal components to be consistent with the four covariates used with the other Q models. The third matrix was based on the discriminant analysis of principal components (DAPC) method in R package adegenet (Jombart et al., 2010). Since DAPC is less widely used than PCA or STRUCTURE, we describe it in more detail. In the first step, k-means clustering was used to identify groups. The value k = 4 minimized the Bayesian Information Criterion (BIC) and was thus used for GWAS (group membership probabilities in Supplemental Table S2). However, for the purpose of discussing population structure we selected k = 6, which was still within the shallow minimum of the BIC curve (Supplemental Fig. S3). In the second step of the DAPC method, linear discriminants were computed based on a reduced-rank representation of the marker matrix (Jombart et al., 2010). Unlike PCA, which maximizes the total variation in the dataset, linear discriminants maximize the ratio of the between-group to within-group sum-of-squares. A cross-validation study revealed that the classification error by linear discriminant analysis (LDA) was minimized over a range of model complexities (Supplemental Fig. S4); we selected 60 principle components for LDA at the upper end of the range.
For each trait, four GWA analyses were conducted, based on the additive, simplex dominant, duplex dominant, and the general SNP models. When multiple significant markers were detected within a 10 Mb region, only the most significant (i.e., lowest p-value) was reported, along with the corresponding SNP model.
Statistical power was estimated for the SolCAP panel genotypes using a similar method as for the simulated populations. An additive QTL with h2 = 0.3 was simulated at each marker, which was considered detected if any marker up to 2.5 Mb from the QTL exceeded the detection threshold of α = 0.05/3242 (i.e., the Bonferroni correction for a genome-wide scan). Extending the detection interval up to 5 Mb from the QTL did not change the median power for the genome (Supplemental Table S3). The average power for each QTL was based on 1000 simulations.
Results and Discussion
Validation with Simulated Data
The GWAS software was validated using simulated phenotypes and genotypes from a random mating autotetraploid population (details in Methods). Our first objective was to determine the quality of the P3D approximation for the mixed model (Zhang et al., 2010; Kang et al., 2010), which is widely used in diploid GWAS to reduce the computing time. The P3D approximation involves estimating the variance components only once by REML, and then using those values for each single-marker hypothesis test. Table 1 compares the statistical power and false positive rate of the full mixed model vs. the P3D model for three different types of simulated QTL: additive, simplex dominant, and duplex dominant (see Methods for more information on these models). Using the same p-value detection threshold for both methods, we observed slightly lower statistical power (0.01–0.05) when using the P3D model but also fewer false positives. If the –log10 p threshold for the P3D model were lowered to achieve the same false positive rate for the two methods, the difference in statistical power would be even smaller. For this relatively small dataset of 400 individuals and 1800 markers (600 for each of three linkage groups), the P3D approximation reduced the computing time by a factor of 20. Thus, given its favorable performance, the P3D approach was used for the remainder of the study.
One of the unique features of the software is its ability to conduct the single marker test for association using different models of gene action. Our hypothesis was that the probability of detecting QTL would be higher if the marker model matched the gene action at unobserved QTL. The results shown in Table 2 confirm this hypothesis: for an additive QTL, analysis with an additive model resulted in a statistical power of 0.94, while the next most powerful model detected the QTL with probability 0.75 (standard errors < 0.01). For a simplex dominant QTL, use of the simplex model in the analysis increased power by 0.28 over the next best model (additive). Table 2 also illustrates the consequences of neglecting dosage information for the heterozygous genotypes, that is, diploidizing the data. If the underlying QTL is simplex dominant, there is no loss of power with diploidized marker data, as the simplex dominant model implies all heterozygous genotypes are equivalent. However, when the QTL was additive or duplex dominant, the best diploid model had significantly less power than the best tetraploid one (losses of 0.19 and 0.67, respectively). Table 2 also shows the potential disadvantage of relying solely on the general tetraploid model, which makes no assumptions about gene action, and thus encompasses the other models. This flexibility comes with a penalty of substantially lower statistical power (more than 0.5 less than the best model) because 4 df are needed for the single marker test. This conclusion still holds when the general model is compared against a combination of multiple single-df models with a higher detection threshold to maintain the same false positive rate (data not shown).
Our third objective was to investigate the effects of marker density and population size on statistical power in autotetraploid GWAS. In diploids it is well established that both factors contribute to higher power (Klein, 2007; Spencer et al., 2009), and this trend was also observed in simulated autotetraploid populations (Fig. 2). The left-most bars in Panels A and B of Fig. 2 correspond to a common scenario of 300 markers per 100 cM chromosome and 200 individuals, which is approximately the size of the real potato dataset analyzed below. Figure 2A shows the effect of increasing population size, while Fig. 2B illustrates higher maker density. For the same proportional increase (e.g., twofold), population size had a bigger effect on power than marker density. The two different series in Fig. 2 (solid vs. open) correspond to different types of QTL models. In both cases the markers are biallelic, but the solid bars correspond to biallelic QTL while the open bars are multiallelic QTL. The loss in power for the latter scenario can be viewed as analogous to the loss in power for the off-diagonal elements in Table 2. In both cases there is a mismatch between the markers in the GWAS model and gene action at the unobserved QTL. This mismatch can potentially be overcome through the use of multimarker haplotypes in GWAS (Lorenz et al., 2010).
GWAS of a Tetraploid Potato Diversity Panel
The SolCAP potato diversity panel included 221 tetraploid lines and 3441 tetraploid SNP markers with minor allele frequency greater than 0.05. Based on version 4.03 of the potato reference genome (Potato Genome Sequencing Consortium, 2011; Sharma et al., 2013), the median distance between markers was 67 kb, with a minimum of 3 bp and maximum of 8.2 Mb.
The diversity panel was comprised of potatoes from eight different market categories, listed in Table 3. Previously, the program STRUCTURE had been used to identify subpopulations in this dataset (Hirsch et al., 2013). A commonly used alternative to STRUCTURE for GWAS is PCA. Figure 3B shows the projection of the population onto the first two principal components, which only account for 8% of the total variation in the marker data (scree plot in Supplemental Fig. S2).
Table 3 Comparison between k-means clustering groups (Roman numerals) and market categories.
Market category | I | II | III | IV | V | VI | Total |
Fry processing | 26 | 0 | 1 | 2 | 3 | 2 | 34 |
Table Russet | 10 | 0 | 0 | 0 | 3 | 0 | 13 |
Wild species | 0 | 3 | 0 | 0 | 0 | 0 | 3 |
Genetic stock | 0 | 0 | 0 | 4 | 1 | 0 | 5 |
Pigmented | 0 | 1 | 28 | 3 | 0 | 0 | 32 |
Yellow | 1 | 0 | 3 | 15 | 3 | 5 | 27 |
Round White table | 2 | 2 | 2 | 8 | 15 | 9 | 38 |
Chip processing | 0 | 1 | 0 | 5 | 29 | 34 | 69 |
Total | 39 | 7 | 34 | 37 | 54 | 50 | 221 |
[IMAGE OMITTED. SEE PDF]
To achieve better separation between subpopulations, we used the DAPC technique (Jombart et al., 2010). In the first step, clusters based on the marker data were compared against the market categories. Table 3 and Fig. 3A show the results for k = 6 clusters. As expected, the DAPC technique produced greater separation among groups than PCA, with Groups I–III clearly separated, and Groups IV–VI apparently more closely related. Group I primarily contains the fry processing and table russets, which are closely related and continue to be intermated by breeders. Group II was a small group containing the majority of the wild species in the panel, and Group III contained most of the pigmented (red and purple) types. Group IV contained the majority of the yellow potatoes, along with some round white potatoes for both tablestock and chip processing. Groups V and VI were predominantly round white potatoes, used for both tablestock and chip processing. Hirsch et al. (2013) also observed minor divergence within the round white types based on hierarchical clustering.
One of the hallmarks of using ordinary linear regression (aka, the naïve model) as a test of association in structured populations is the inflation of the –log(p) values relative to the expected value under the null hypothesis (Supplemental Fig. S5; Freedman et al., 2004; Clayton et al., 2005). The use of subpopulation group membership as a covariate in the analysis helps to reduce this inflation. Each boxplot in Fig. 4 shows the distribution of inflation factors across the 13 traits analyzed in this study. All three of the Q models tested—DAPC, PCA, and STRUCTURE—were able to reduce inflation relative to the naïve model, with DAPC and PCA slightly better than STRUCTURE. The DAPC approach (QDAPC) was selected for subsequent analyses.
[IMAGE OMITTED. SEE PDF]
As first shown by Yu et al. (2006), a random polygenic effect with covariance proportional to a kinship matrix K can also reduce inflation. The results in Fig. 4 show that all three kinship matrices we tested were effective, with perhaps a slight advantage to the realized relationship model (KRR = MMT for genotype matrix M), which had significantly less inflation than any of the Q models. Subsequent analyses were conducted using the QDAPC + KRR model.
Several categories of traits were measured on the diversity panel, including agronomic (e.g., total yield, vine maturity), morphological (e.g., tuber shape and eye depth), and biochemical (e.g., tuber sucrose and glucose) properties. Table 4 presents the broad-sense heritability on an entry-mean basis for each trait, which ranged from 0.60 for tuber malic acid content to 0.94 for tuber shape.
Table 4 Broad-sense heritability (h2) and significant quantitative trait loci (QTL) in the potato diversity panel.
Trait | No. environments | h2 | Significant QTL | Model† |
Total yield | 4 | 0.73 | c2_10614 (chr 4 at 71827521, –log10p = 4.8) | DD |
Chip color | 4 | 0.91 | none | |
Eye depth | 4 | 0.74 | c2_11685 (chr 5 at 2288291, –log10p = 4.7) | AD |
c1_8019 (chr10 at 48863165, –log10p = 7.2) | AD | |||
Tuber shape | 4 | 0.94 | c1_8019 (chr10 at 48863165, –log10p = 9.5) | AD |
Tuber size | 2 | 0.81 | none | |
Tuber length | 2 | 0.91 | c1_8019 (chr10 at 48863165, –log10p = 6.5) | AD |
Tuber width | 2 | 0.87 | none | |
Sucrose | 2 | 0.67 | none | |
Glucose | 2 | 0.78 | none | |
Fructose | 2 | 0.85 | none | |
Malic acid | 2 | 0.60 | none | |
Vine maturity at 95 d | 3 | 0.69 | c2_34548 (chr1 at 84727196, –log10p = 4.5) | DD |
c2_13133 (chr9 at 8245062, –log10p = 5.2) | SD | |||
c1_9183 (chr11 at 42627957, –log10p = 4.7) | AD | |||
Vine maturity at 120 d | 2 | 0.80 | c2_25219 (chr7 at 47348171, –log10p = 5.0) | GEN |
Significant QTL were detected for 6 the 13 traits, although many of the QTL were only marginally significant (Table 4; results for all markers in Supplemental Table S4). Significant QTL were not detected for the three tuber sugar traits (sucrose, glucose, and fructose) even though they had heritability comparable to the other traits. This was unexpected as metabolic traits typically have fewer causal loci with larger (and thus more easily detectable) effects compared with a complex trait such as yield (Riedelsheimer et al., 2012). The most significant QTL were for tuber shape and tuber eye depth, both at 48.9 Mb on chromosome 10 (Supplemental Fig. S6). Several biparental linkage mapping studies have mapped major QTL for these traits to the same region (Van Eck et al., 1994; Śliwka et al., 2008; Li et al., 2005; Prashar et al., 2014), although the molecular identities of the QTL have not yet been published. QTL studies in potato frequently detect a major locus affecting plant maturity on chromosome 5 (Bradshaw et al., 2008), which was identified as the StCDF1 gene by Kloosterman et al. (2013). This locus was not detected in our analysis of the SolCAP plant maturity data, although minor QTL were identified on chromosomes 1, 7, 9, and 11.
To better understand the scarcity of major QTL in the GWAS results for the SolCAP panel, a power simulation was performed using simulated QTL and phenotypes, but with the actual marker data. For a monogenic trait with h2 = 0.3, the genome-wide median for the probability of QTL detection was only 0.01 (results for all loci in Supplemental Table S3). Although low power was expected considering the small size of the population (N = 187), this result was even lower than anticipated. To determine if marker density also played a role, the power was plotted against the distance between the QTL and its closest marker (Fig. 5). The red trendline in Fig. 5, which is the 95th percentile, shows that power was lower in regions of lower marker density. We conclude that, in addition to increasing the population size, higher marker density could also improve future GWAS studies in potato.
[IMAGE OMITTED. SEE PDF]
The GWASpoly software is being distributed under the GNU Public License and can be downloaded from (verified 29 Jan. 2016).
Author Contributions
Designed the research: JBE. Contributed phenotypic data: WSD, DSD. Developed the software and analyzed the data: URR, JBE. Wrote the manuscript: URR, JBE.
Supplemental Information Available
Supplemental information is included with this article.
Supplemental Figures S1–S6.
Supplemental Table S1. Marker data for the potato diversity panel.
Supplemental Table S2. Phenotype data for the potato diversity panel.
Supplemental Table S3. Statistical power for the potato diversity panel.
Supplemental Table S4. GWAS results for the potato diversity panel.
Acknowledgments
Financial support was provided to U.R.R. by USDA-NIFA-SCRI Grant No. 2011-51181-30629 (Improved Breeding and Variety Evaluation Methods to Reduce Acrylamide Content and Increase Quality in Processed Potato Products) and to J.B.E. by USDA-NIFA-Hatch Accession No. 1002731 (Genome-wide Association Analysis and Breeding in Potato). Collection of the phenotype and marker data was supported by USDA-NIFA-AFRI Grant No. 2009-85606-05673 (Translating Solanaceae Sequence Diversity and Trait Variation into Applied Outcomes through Integrative Research, Education, and Extension). We thank Paul Bethke and Shelley Jansky for contributing phenotypic data.
Allison, D.B. 1997. Transmission‐disequilibrium tests for quantitative traits. Am. J. Hum. Genet. 60:676–690.
Balding, D.J. 2006. A tutorial on statistical methods for population association studies. Nat. Rev. Genet. 7:781–791. doi:
Bradbury, P.J., Zhang, Z., Kroon, D.E., Casstevens, T.M., Ramdoss, Y., Buckler, E.S.. 2007. TASSEL: Software for association mapping of complex traits in diverse samples. Bioinform. 23:2633–2635. doi:
Bradshaw, J.E., Hackett, C.A., Pande, B., Waugh, R., Bryan, G.J.. 2008. QTL mapping of yield, agronomic and quality traits in tetraploid potato (Solanum tuberosum subsp. tuberosum). Theor. Appl. Genet. 116:193–211. doi:
Churchill, G.A., Doerge, R.W.. 1994. Empirical threshold values for quantitative trait mapping. Genetics 138:963–971.
Clark, S.A., Hickey, J.M., Daetwyler, H.D., van der Werf, J.H.J.. 2012. The importance of information on relatives for the prediction of genomic breeding values and the implications for the makeup of reference data sets in livestock breeding schemes. Genet. Sel. Evol. 44:4. doi:
Clayton, D.G., Walker, N.M., Smyth, D.J., Pask, R., Cooper, J.D., Maier, L.M. et al. 2005. Population structure, differential bias and genomic control in a large‐scale, case–control association study. Nat. Genet. 37:1243–1246. doi:
Endelman, J.B. 2011. Ridge regression and other kernels for genomic selection with R package rrBLUP. Plant Gen. 4:250–255. doi:
Endelman, J.B., Jannink, J.‐L.. 2012. Shrinkage estimation of the realized relationship matrix. G3: Genes, Genomes, Genet. 2:1405–1413. doi:
Felcher, K.J., Coombs, J.J., Massa, A.N., Hansey, C.N., Hamilton, J.P., Veilleux, R.E., Buell, C.R., Douches, D.S.. 2012. Integration of two diploid potato linkage maps with the potato genome sequence. PLoS ONE 7: [eLocator: e36347]. doi:
Freedman, M.L., Reich, D., Penney, K.L., McDonald, G.J., Mignault, A.A., Patterson, N. et al. 2004. Assessing the impact of population stratification on genetic association studies. Nat. Genet. 36:388–393. doi:
Gallais, A. 2003. Quantitative genetics and breeding methods in autopolyploid plants. INRA, Paris.
Gianola, D., van Kaam, J.B.C.H.M.. 2008. Reproducing Kernel Hilbert Spaces Regression methods for genomic assisted prediction of quantitative traits. Genetics 178:2289–2303. doi:
Hamilton, J.P., Hansey, C.N., Whitty, B.R., Stoffel, K., Massa, A.N., van Deynze, A., de Jong, W.S., Douches, D.S., Buell, C.R.. 2011. Single nucleotide polymorphism discovery in elite North American potato germplasm. BMC Genomics 12:302. doi:
Hirsch, C.N., Hirsch, C.D., Felcher, K., Coombs, J., Zarka, D., van Deynze, A. et al. 2013. Retrospective view of North American potato (Solanum tuberosum L.) breeding in the 20th and 21st centuries. G3: Genes, Genomes, Genet. 3:1003–1013. doi:
Jombart, T., Devillard, S., Balloux, F.. 2010. Discriminant analysis of principal components: A new method for the analysis of genetically structured populations. BMC Genet. 11:94. doi:
Kang, H.M., Sul, J.H., Service, S.K., Zaitlen, N.A., Kong, S., Freimer, N.B., Sabatti, C., Eskin, E.. 2010. Variance component model to account for sample structure in genome‐wide association studies. Nat. Genet. 42:348–354. doi:
Kang, H.M., Zaitlen, N.A., Wade, C.M., Kirby, A., Heckerman, D., Daly, M.J., Eskin, E.. 2008. Efficient control of population structure in model organism association mapping. Genetics 178:1709–1723. doi:
Klein, R.J. 2007. Power analysis for genome‐wide association studies. BMC Genet. 8:58. doi:
Kloosterman, B., Abelenda, J.A., Gomez, M.D.M.C., Oortwijn, M., de Boer, J.M., Kowitwanich, K. et al. 2013. Naturally occurring allele diversity allows potato cultivation in northern latitudes. Nature 495:246–250. doi:
Li, M., Liu, X., Bradbury, P., Yu, J., Zhang, Y.‐M., Todhunter, R.J., Buckler, E.S., Zhang, Z.. 2014. Enrichment of statistical power for genome‐wide association studies. BMC Biol. 12:73. doi:
Li, X.Q., de Jong, H., de Jong, D.M., de Jong, W.S.. 2005. Inheritance and genetic mapping of tuber eye depth in cultivated diploid potatoes. Theor. Appl. Genet. 110:1068–1073. doi:
Li, X., Wei, Y., Moore, K.J., Michaud, R., Viands, D.R., Hansen, J.L., Acharya, A., Brummer, E.C.. 2011. Association mapping of biomass yield and stem composition in a tetraploid alfalfa breeding population. Plant Gen. 4:24–35. doi:
Lipka, A.E., Tian, F., Wang, Q., Peiffer, J., Li, M., Bradbury, P.J., Gore, M.A., Buckler, E.S., Zhang, Z.. 2012. GAPIT: Genome association and prediction integrated tool. Bioinform. 28:2397–2399. doi:
Lorenz, A.J., Hamblin, M.T., Jannink, J.‐L.. 2010. Performance of single nucleotide polymorphisms versus haplotypes for genome‐wide association analysis in barley. PLoS ONE 5(11): [eLocator: E14079]. doi:
Malosetti, M., van der Linden, C.G., Vosman, B., van Eeuwijk, F.. 2007. A mixed‐model approach to association mapping using pedigree information with an illustration of resistance to Phytophthora infestans in potato. Genetics 175:879–889. doi:
McCulloch, C.E., Searle, S.R.. 2001. Generalized, linear, and mixed models. John Wiley & Sons, New York.
Myles, S., Peiffer, J., Brown, P.J., Ersoz, E.S., Zhang, Z., Costich, D.E., Buckler, E.S.. 2009. Association mapping: Critical considerations shift from genotyping to experimental design. Plant Cell 21:2194–2202. doi:
Oliehoek, P.A., Windig, J.J., van Arendonk, J.A., Bijma, P.. 2006. Estimating relatedness between individuals in general populations with a focus on their use in conservation programs. Genetics 173:483–496. doi:
Pajerowska‐Mukhtar, K., Stich, B., Achenbach, U., Ballvora, A., Lubeck, J., Strahwald, J., Tacke, E., Hofferbert, H.R., Ilarionova, E., Bellin, D., Walkemeier, B., Basekow, R., Kersten, B., Gebhardt, C.. 2009. Single nucleotide polymorphisms in the allene oxide synthase 2 gene are associated with field resistance to late blight in populations of tetraploid potato cultivars. Genetics 181:1115–1127. doi:
Piepho, H.P. 2009. Ridge regression and extensions for genomewide selection in maize. Crop Sci. 49:1165–1176. doi:
Potato Genome Sequencing Consortium. 2011. Genome sequence and analysis of the tuber crop potato. Nature 475:189–195. doi:
Prashar, A., Hornyik, C., Young, V., McLean, K., Sharma, S.K., Dale, M. F B., Bryan, G.J.. 2014. Construction of a dense SNP map of a highly heterozygous diploid potato population and QTL analysis of tuber shape and eye depth. Theor. Appl. Genet. 127:2159–2171. doi:
Price, A.L., Patterson, N.J., Plenge, R.M., Weinblatt, M.E., Shadick, N.A., Reich, D.. 2006. Principal components analysis corrects for stratification in genome‐wide association studies. Nat. Genet. 38:904–909. doi:
Pritchard, J.K., Stephens, P., Donnelly, P.. 2000. Inference of population structure using multilocus genotype data. Genetics 155:945–959.
R Development Core Team. 2014. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.
Riedelsheimer, C., Lisec, J., Czedik‐Eysenberg, A., Sulpice, R., Flis, A., Grieder, C., Altmann, T., Stitt, M., Willmitzer, L., Melchinger, A.E.. 2012. Genome‐wide association mapping of leaf metabolic profiles for dissecting complex traits in maize. Proc. Natl. Acad. Sci. USA 109:8872–8877. doi:
Sharma, S.K., Bolser, D., de Boer, J., Sønderkaer, M., Amoros, W., Carboni, M.F. et al. 2013. Construction of reference chromosome‐scale pseudomolecules for potato: Integrating the potato genome with genetic and physical maps. G3: Genes, Genomes, Genet. 3:2031–2047. doi:
Simko, I., Haynes, K.G., Jones, R.W.. 2006. Assessment of linkage disequilibrium in potato genome with single nucleotide polymorphism markers. Genetics 173:2237–2245. doi:
Śliwka, J., Wasilewicz‐Flis, I., Jakuczun, H., Gebhardt, C.. 2008. Tagging quantitative trait loci for dormancy, tuber shape, regularity of tuber shape, eye depth and flesh colour in diploid potato originated from six Solanum species. Plant Breed. 127:49–55. doi:
Spencer, C.C., Su, Z., Donnelly, P., Marchini, J.. 2009. Designing genome‐wide association studies: Sample size, power, imputation, and the choice of genotyping chip. PLoS Genet. 5(5): [eLocator: E1000477]. doi:
Spielman, R.S., McGinnis, R.E., Ewens, W.J.. 1993. Transmission test for linkage disequilibrium: The insulin gene region and insulin‐dependent diabetes mellitus (IDDM). Am. J. Hum. Genet. 52:506–516.
Stich, B., Gebhardt, C.. 2011. Detection of epistatic interactions in association mapping populations: An example from tetraploid potato. Heredity 107:537–547. doi:
Stich, B., Mohring, J., Piepho, H.‐P., Heckenberger, M., Buckler, E.S., Melchinger, A.E.. 2008. Comparison of mixed‐model approaches for association mapping. Genetics 178:1745–1754. doi:
Storey, J.D., Tibshirani, R.. 2003. Statistical significance for genomewide studies. Proc. Natl. Acad. Sci. USA 100:9440–9445. doi:
Uitdewilligen, J.G.A.M.L., Wolters, A.‐M.A., D'hoop, B.B., Borm, T.J.A., Visser, R.G.F., van Eck, H.J.. 2013. A next‐generation sequencing method for genotyping‐by‐sequencing of highly heterozygous autotetraploid potato. PLoS ONE 8(5): [eLocator: e62355]. doi:
van Eck, H.J., Jacobs, J.M., Stam, P., Ton, J., Stiekema, W.J., Jacobsen, E.. 1994. Multiple alleles for tuber shape in diploid potato detected by qualitative and quantitative genetic analysis using RFLPs. Genetics 137:303–309.
vanRaden, P.M. 2008. Efficient methods to compute genomic predictions. J. Dairy Sci. 91:4414–4423. doi:
Voorrips, R.E., Gort, G., Vosman, B.. 2011. Genotype calling in tetraploid species from bi‐allelic marker data using mixture models. BMC Bioinform. 12:172. doi:
Voorrips, R.E., Maliepaard, C.A.. 2012. The simulation of meiosis in diploid and tetraploid organisms using various genetic models. BMC Bioinform. 13:248. doi:
Yu, J., Pressoir, G., Briggs, W.H., Vroh Bi, I., Yamasaki, M., Doebley, J.F., McMullen, M.D., Gaut, B.S., Nielsen, D., Holland, J.B., Kresovich, S., Buckler, E.S.. 2006. A unified mixed‐model method for association mapping that accounts for multiple levels of relatedness. Nat. Genet. 38:203–208. doi:
Zhang, Z., Ersoz, E., Lai, C., Todhunter, R.J., Tiwari, H.K., Gore, M.A., Bradbury, P.J., Yu, J., Arnett, D.K., Ordovas, J.M., Buckler, E.S.. 2010. Mixed linear model approach adapted for genome‐wide association studies. Nat. Genet. 42:355–360. doi:
Zhao, K., Aranzana, M.J., Kim, S., Lister, C., Shindo, C., Tang, C., Toomajian, C., Zheng, H., Dean, C., Marjoram, P., Nordborg, M.. 2007. An Arabidopsis example of association mapping in structured samples. PLoS Genet. 3(1): [eLocator: e4]. doi:
Zhou, X., Stephens, M.. 2012. Genome‐wide efficient mixed‐model analysis for association studies. Nat. Genet. 44:821–824. doi:
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2016. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the "License"). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
Genome‐wide association studies (GWAS) are widely used in diploid species to study complex traits in diversity and breeding populations, but GWAS software tailored to autopolyploids is lacking. The objectives of this research were to (i) develop an R package for autopolyploids based on the Q + K mixed model, (ii) validate the software with simulated data, and (iii) analyze a diversity panel of tetraploid potatoes. A unique feature of the R package, called GWASpoly, is its ability to model different types of polyploid gene action, including additive, simplex dominant, and duplex dominant. Using a simulated tetraploid population, we confirmed our hypothesis that statistical power is higher when the assumed gene action in the GWAS model matches the gene action at unobserved quantitative trait loci (QTL). Thirteen traits were analyzed in the Solanaceae Coordinated Agricultural Project (SolCAP) potato diversity panel and, consistent with previous studies, significant QTL for tuber shape and eye depth co‐localized on chromosome 10. For the other traits, only marginally significant QTL were detected, most likely due to insufficient statistical power: for simulated traits with a heritability (h2) of 0.3, the median genome‐wide power was only 0.01. Our results indicate that both marker density and population size were limiting factors for GWAS with the SolCAP panel.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Details
1 Dep. of Horticulture, Univ. of Wisconsin, Madison, WI,
2 School of Integrative Plant Science, Cornell Univ., Ithaca, NY,
3 Dep. of Plant, Soil and Microbial Sciences, Michigan State Univ., East Lansing, MI,