-
Abbreviations
- BPH
- better-parent heterosis
- FWT
- fruit weight
- g-BLUP
- genomic best linear unbiased predictor
- GBS
- genotyping-by-sequencing
- GCA
- general combining ability
- GEBV
- genomic estimated breeding value
- GS
- genomic selection
- GV
- growth vigor
- LASSO
- least absolute shrinkage and selection operator
- LD
- linkage disequilibrium
- LMM
- linear mixed-effects model
- LS
- leaf size
- MPH
- midparent heterosis
- PCA
- principal component analysis
- QTL
- quantitative trait loci
- REML
- restricted maximum likelihood estimation
- SCA
- specific combining ability
- SNP
- single nucleotide polymorphism
- TBV
- true breeding value
- Pumpkin, with rich nutritional compounds, is a staple food in many developing countries.
- Genomic prediction is a practical tool for hybrid performance evaluation in plant breeding.
- Genomic prediction can reduce cost and accelerate a breeding program.
- Genomic prediction applied to select potential hybrid combinations and superior parental lines.
Pumpkin (2n = 40), which belongs to Cucurbitaceae family, consists of at least five domesticated species: C. maxima, C. moschata, C. pepo L., C. argyrosperma C. Huber, and C. ficifolia Bouché (Schaffer and Paris, 2003). The annual production of all Cucurbita crops in the world has exceeded 26.4 Tg in 2016 (
Hybrid breeding is a means of using heterosis to improve the yield and quality of a crop. The MPH (Falconer and Mackay, 1996) is defined as the difference between the hybrid performance and the average of its parental lines. Alternatively, BPH (Fonseca and Patterson, 1968) is defined as the hybrid performance superior to the higher or better parental line. The hybrid breeding is preferred to inbred-line breeding for some reasons such as hybrids can bring greater yield stability, heterozygosity of hybrids allows the combination of dominant major genes, and hybrids offer a built-in plant variety protection system (Longin et al., 2012).
A full- or half-diallel design has been conventionally used to estimate SCA for hybrids and GCA for their parental lines in a field trial (Darrudi et al., 2018). From the genetics perspective, the GCA is an indication of additive effects, and the SCA is relevant to nonadditive effects such as dominance and epitasis. In a hybrid breeding program, the main interest of breeders lies in selection of potential hybrid combinations with high SCA and superior parental lines with high GCA (Virmani, 1994). However, the number of crossing combinations can increase dramatically as the number of parental lines increases. This could raise an important issue that the traditional evaluation method for all possible hybrid combinations can be costly or impractical in the field experiment (Fristche-Neto et al., 2018). Recently, GS has been increasingly popular in plant breeding, which can reduce cost and accelerate a breeding program (Jannink et al., 2010; Desta and Ortiz, 2014). In practice, GS has been applied to predict hybrid performance for some important crops such as maize (Zea mays L.) (Technow et al., 2012, 2014), rice (Oryza sativa L.) (Xu et al., 2014; Wang et al., 2016) and wheat (Triticum aestivum L.) (Zhao et al., 2013).
The main concept of GS is to capture all effects of quantitative trait loci (QTL) using dense DNA markers over an entire genome (Meuwissen et al., 2001). The most common DNA markers used in GS are SNPs. Typically, a training population with known genotype and phenotype data is used to build a GS model. Then, the resulting GS model is employed to predict GEBVs for the individuals of a test population with known genotype data only. It is anticipated that incorporating both additive and dominance marker effects into a GS model for hybrid performance prediction could provide better prediction accuracy than the model with additive effects only.
Most of the genomic prediction studies focus on quantitative traits measured in continuous scale. To save the cost of phenotyping, ordinal scores are sometimes recorded for a quantitative trait such as the degree of disease resistance, growth vigor, fruit set ability, etc. The prediction accuracy could be reduced if the methods for analyzing continuous traits are directly adopted to ordinal traits, which are measured by positive integer numbers. Iwata et al. (2009) developed a Bayesian method based on the threshold model for association study on ordinal traits. Similarly, Wang et al. (2013) extended the GS study for ordinal traits and called their proposed methods BayesT methods. Kizilkaya et al. (2014) further quantified the reduction in prediction accuracy of a GS study for ordinal traits vs. continuous traits through simulation studies. In this study, we apply three BayesT methods to analyze two ordinal traits under the additive–dominance effects model.
MATERIALS AND METHODS Pumpkin Data SetThe genomic DNA of 320 parental lines was extracted from fresh leaves using a Qiagen Plant DNeasy kit. The genotyping-by-sequencing (GBS) libraries were prepared by the protocol proposed in Elshire et al. (2011) and were sequenced on Illumina system HiSeq 2500 machine. The GBS sequenced reads from 320 parental lines of pumpkin were processed using TASSEL-GBS pipeline for SNP calling (Glaubitz et al., 2014) and aligned to the genome of C. maxima downloaded from Cucurbit Genomics Database (
The phenotypic data of hybrids comprise one continuous trait of fruit weight (FWT) and two ordinal traits of growth vigor (GV) and leaf size (LS). The number of phenotypic values of the traits for two intracrossing groups of C. maxima and C. moschata is given in Table 1. Growth vigor is related to the vitality and exuberance of a plant, which was measured at the beginning of reproductive stage and recorded by five ordinal scores. Leaf size was classified into five ordinal scores according to the area of fully expanded leaf.
The phenotypic data were summarized from a collection of historical data provided by Known-You Seed Co., Ltd., which were scattered over the period of 1988 to 2016. All the trials were conducted at the experimental field of the seed company located in southern Taiwan, which can be treated as a single location. Every hybrid had six to ten observations at each time point, and the average of the observations is used as the phenotypic value for the hybrid at the time point. The phenotypic values of one hybrid could be observed across different years, so we remove the year effects from the phenotypic values. It is assumed that the year effects are random effects following the normal distribution . Then, the variance component is estimated by pooling all possible sample variances of hybrids as follows:[Image Omitted. See PDF]where ki is the number of phenotypic values over the years for hybrid i and is its sample variance. The estimates of year effects are then generated from . The are obtained as 0.076 and 0.243 for trait FWT of the intracrossing groups C. maxima and C. moschata, respectively. The adjusted mean for each hybrid is the average of the phenotypic values, which have been adjusted to the corresponding year effect estimates. The adjusted means of hybrids are used for building GS models.
Table 1 The number of phenotypic values for the traits in the two intracrossing groups.
FWT, fruit weight (continuous trait); GV, growth vigor (ordinal trait); LS, leaf size (ordinal trait).
Statistical ModelsThe following additive-effects model, abbreviated as the A model hereinafter, is typically used in a GS study: [Image Omitted. See PDF]where y(ij) is a continuous phenotypic value of hybrid Pi × Pj (both Pi and Pj are parental lines); μ is the constant term; coded as -1, 0, or 1 if M1M1, M1M2, or M2M2 occurs at locus l for the hybrid; the additive effect at locus l; p the number of markers and e(ij) a random error following the normal distribution . However, considering the heterozygosity effects for a hybrid, we extend to the following additive plus dominance effects model, abbreviated as the A+D model hereinafter: [Image Omitted. See PDF]where coded as 1 if M1M2 occurs at locus l for the hybrid; otherwise , and denotes the dominance effect at locus l.
If y(ij) is an ordinal phenotypic value, we proceed with the threshold model. Suppose that y(ij) is recorded as one of m ordinal scores, and the threshold model is given by the following:[Image Omitted. See PDF]where γ1,γ2,…,γm−1 are threshold parameters used to define the ordinal categories, and w(ij) denotes a latent variable. This means that the original phenotypic value is continuous but now recorded as a positive integer. In other words, w(ij) can be thought of as a scaled continuous phenotypic value of Model 2 (Eq. [2]). Let
, then this relationship can be described by w(ij) = η(ij) + ε(ij) where ε(ij) is distributed as the standard normal distribution N(0,1). Also, let γ be the threshold parameters vector. The conditional probability of y(ij) given η(ij) and γ is thus obtained as follows: [Image Omitted. See PDF]where Φ(·) is the cumulative probability function of N(0,1). This conditional probability is called as the probit link that is a function of the unknown marker effects and threshold parameters.
Estimation Methods for Marker EffectsIt is challenging to estimate all the parameters in the GS Model 1 or 2, because that the number of the additive effects in Model 1 or both the additive and dominance effects in Model 2 is usually much greater than the number of phenotypic values. We thus adopt linear mixed-effects model (LMM) and three Bayesian estimation methods—Bayes A, Bayes B, and Bayes C—to tackle this problem.
Let βΑ and βD be the additive and dominance effects vectors, respectively. The LMM estimation treats βΑ and βD as two sets of random effects distributed with and , respectively. The BLUPs for βΑ and βD are then obtained through the following Henderson's equations (Henderson, 1984):[Image Omitted. See PDF]where Ip is the identity matrix of order p (here p is the number of markers), and 1n the unit vector of length n (here n is the number of the hybrids). The regularization parameters are given by . The variance components are estimated using restricted maximum likelihood estimation (REML). This method can be performed with the built-in function mmer() in R package sommer (Covarrubias-Pazaran, 2016).
Bayes A implements three hierarchical priors. It is assumed that βA ∼ N(0,GA) and βD ∼ N(0,GD), where GA and GD are diagonal matrices with diagonal elements , separately. Here stand for the variance parameters of additive and dominance effects at locus l, respectively. Second, the variance parameters are assumed to be distributed as scaled-inverse chi-square distributions. Third, the scale parameters in the scaled-inverse chi-square distributions are further assumed to follow gamma distributions. Alternatively, Bayes B is modified from Bayes A but considers a mixture priors that a proportion π of marker effects is nonzero, and the remaining 1 − π is null effect. The nonzero effects have the same priors structure as Bayes A. Bayes C also implements the mixture priors same as Bayes B, except the variances of the nonzero effects are assumed to be homogenous. We refer to Perez and de los Campos (2014) for more details regarding these Bayesian methods and use their R package BGLR to perform data analysis.
The Bayesian estimation methods Bayes TA, Bayes TB, and Bayes TC are based on the probit link function of Eq. [3], in which their hierarchical priors are according to Bayes A, Bayes B, and Bayes C, respectively. The unknown marker effects together with the threshold parameters are estimated via Gibbs sampling using the R package BGLR.
Simulation StudyA simulation study is conducted to compare the two GS models and evaluate the four estimation methods. We use the genotype data of 142 accessions within C. maxima as a template. The 142 parental lines can produce =10,011 possible hybrids. We first perform linkage disequilibrium (LD) decay analysis on the original 4521 SNPs. The detail regarding the LD decay analysis can be found in Supplemental information. From the LD decay analysis result, we determine 1174 LD blocks over the genome. We thus arbitrarily fix the number of QTL as the number of LD blocks in this simulation study. The detail regarding the simulation study is given below.
Step 1: Specify the true breeding values (TBVs) for the 10,011 hybrids. We first randomly select one SNP from each LD block then specify marker effects for these 1174 chosen SNPs. The true additive effects are generated from the gamma distribution with α1 = 1 and α2 = 2, where α1 and α2 are the shape and scale parameters, respectively. We consider three different cases for the specification of dominance effects. Case 1 simply considers the null dominance effects, that is, letting all the dominance effects be zero. For Cases 2 and 3, we first separately generate the true values of the ratio of dominance effect to additive effect from another two gamma distributions with (α1, α2) = (1, 0.5) and (2, 0.5). The true dominance effects are then specified as the true additive effects multiplied by the true values of the ratio. The expected values of the ratio are equal to 0.5 and 1 (α1×α2) for Cases 2 and 3, respectively. So Case 3 would have stronger dominance effects than Case 2. The averages of the true MPH (the summation of the true dominance effects over all the 1174 SNPs) are given by 0, 388, and 738 for Cases 1, 2, and 3, respectively. In summary, Cases 1, 2, and 3 represent the situations of null, mediate, and strong dominance effects, respectively. The TBVs for the 10,011 hybrids are then obtained as the summation of the true additive and dominance effects over all the 1174 SNPs.
Step 2: Generate the simulated phenotypic values for the 10,011 hybrids. The sample variance for the TBVs specified in Step 1 is calculated as the estimate for genetic variance, denoted by . The simulated phenotypic values are their TBVs plus the random residuals sampled from a normal distribution , where h2 represents the heritability that is fixed at 0.3, 0.5, and 0.7. The observed values of heritability are calculated as 0.299, 0.500, and 0.699 from the 10,011 simulated data.
Step 3: Determine the training population and build the GS models. For each simulated data set, we randomly choose 150 hybrids from the 10,011 hybrids as the training population to build the two GS models by the four estimation methods. Note that the marker matrix consists of the original 4521 SNPs in the GS model building.
Step 4: Predict GEBVs for the individuals in the test population and calculate Pearson's correlation between the GEBVs and simulated phenotypic values. The remaining 9861 hybrids not selected in the training population in Step 3 serve as the test population. The GEBVs are exactly the fitted values from the two GS models.
There are nine simulation scenarios (three different cases for true dominance effects by three different values of h2). The procedure of Steps 2 to 4 is repeated 100 times for each simulation scenario. The average of the resulting 100 Pearson's correlation coefficients will be reported to measure prediction accuracy for the simulation scenario.
Pumpkin Data AnalysisWe analyze the pumpkin data for each of C. maxima and C. moschata intracrossing groups with the following steps:
Step 1: Evaluate the GS models and estimation methods through data cross-validation. We employ the leave-one-out procedure, in which the two GS models are built using n − 1 hybrids and validated with the remaining hybrid. This process is repeated n times such that all the hybrids have been the validated individual. For the continuous trait, Pearson's correlation is still used as a measure of prediction accuracy. However, Pearson's correlation is unable to exactly measure the prediction accuracy for ordinal traits because the phenotypic values are discrete levels. Here we propose an index to evaluate the cross-validation. From the probit link function of Eq. [3], we simply replace the latent value η(ij) with its GEBV to yield the estimated prediction probability of the test hybrid that falls into each ordinal category. The test hybrid is thus designated to be the category with the highest prediction probability. The correct prediction probability is defined as the probability of correct prediction for the n hybrids.
Step 2: Derive SCA for each hybrid and GCAs for its parental lines to quantify the degree of MPH and BPH of the hybrid combination. Let SCA(ij) be the SCA for hybrid PiʹPj. Also, let GCA(i) and GCA(j) be the GCAs for its parental lines i and j, respectively. From Werner et al. (2018), the A+D model of Eq. [2] can be rewritten as follows: [Image Omitted. See PDF]where
(the dominance effects part). Here is coded as −0.5, 0, or 0.5 if M1M1, M1M2, or M2M2 occurs at locus l for parental line i and similarly is for parental line j, so that (the additive effects part).
Moreover, let GEBV(ij), GEBV(i), and GEBV(j) be the GEBVs for the hybrid and its two parental lines. Then, from Model 4, there exist the following relationships:[Image Omitted. See PDF]where E[·] denotes the expectation operator. Furthermore, from the definition of MPH and BPH for a hybrid combination, we have the following:[Image Omitted. See PDF]where |GCA(i) − GCA(j)| denotes the absolute value of GCA(i) − GCA(j). So the degree of MPH for a hybrid can be directly quantified by its GEBV-based SCA, and the degree of BPH must be measured by not only its GEBV-based SCA but also its parents’ GCAs. Under the positive heterosis assumption, the value of MPH or BPH is larger, and the heterosis of the hybrid is stronger.
Step 3: Estimate the variance components and heritability for a continuous trait. Based on Model 4, we treat all the GCA effects as a set of random effects and all the SCA effects as another set of random effects. Then, we express the genomic BLUP (g-BLUP) model as the following matrix form: [Image Omitted. See PDF]where y denotes the phenotypic values vector; 1n is the unit vector of length n (here n is the number of the hybrids); μ is the constant term; uGCA and uSCA are vectors of GCA and SCA effects, respectively; ZGCA and ZSCA are incidence matrices of GCA and SCA effects, respectively; and e is the vector of residuals following the normal distribution . It is assumed that , where A and D denote the additive and dominance genomic relationship matrices, respectively. The A and D matrices are calculated using the methods presented in Endelman and Jannink (2012) and Su et al. (2012), respectively. We use the sommer package (Covarrubias-Pazaran, 2016) to obtain the REML estimates for the variance components . Subsequently, the heritability is obtained as the ratio of genetic variance to the total phenotypic variance .
RESULTS Simulation StudyThe averages of the 100 simulated Pearson's correlation coefficients estimated from the four estimation methods for the nine simulation scenarios based on the two GS models are displayed in Fig. 1. The A model is found to have slightly better prediction accuracy than the A+D model under Case 1 of the null dominance effects situation. This could be due to the redundant dominance effect parameters in the A+D model. However, the A+D model indeed appears to have better prediction accuracy than the A model for both Cases 2 and 3 of the mediate and strong dominance effects situations. The difference of the Pearson's correlation averages between the two GS models increases as the true dominance effects get stronger. Interestingly, the Pearson's correlation average at a fixed heritability and GS model combination decreases as the true dominance effects get stronger. For example, at LMM, h2 = 0.7 and the A+D model, the average of the 100 simulated Pearson's correlation coefficients, is 0.765 for Case 1, 0.689 for Case 2. and 0.637 for Case 3. This could be due to that the bias in estimation of marker effects increases as the true dominance effects get stronger. In consequence, the A+D model could have robust prediction ability for hybrid performance, even the trait under study is mainly controlled by additive effects. Moreover, the LMM and three Bayesian estimations are found to have quite similar prediction accuracy under the same scenario and the GS model. As expected, prediction accuracy increases with the increase in heritability.
Pumpkin Data Analysis Cross-ValidationWe first evaluate the two GS models and the four estimation methods through the leave-one-out cross-validation to determine the most appropriate method for each trait. The resulting Pearson's correlation for the continuous trait of FWT is displayed in Table 2. From the table, the A+D model can have slightly better prediction accuracy than the A model in the C. maxima intracrossing group. In contrast to the simulation results of Fig. 1, this might imply that the variation of trait FWT in the C. maxima intracrossing group is due to both the additive and dominance effects. On the other hand, the A model slightly outperforms the A+D model in the C. moschata intracrossing group, implying that trait FWT in this group could be almost completely controlled by the additive effects, and the dominance effects can be negligible. The Pearson's correlation for the C. maxima group is much greater than the C. moschata group, indicating that the heritability for FWT of the C. maxima intracrossing group should be significantly larger than the C. moschata intracrossing group.
Fig. 1. The averages of the 100 simulated Pearson's correlation coefficients (prediction accuracy) under the four estimation methods (LMM, Bayes A, Bayes B, and Bayes C) and nine simulation scenarios (three different heritability [0.3, 0.5, and 0.7] crossing three cases [Case 1, null dominance effects; Case 2, mediate dominance effects; and Case 3, strong dominance effects]) using the two genomic selection models (A and A+D model). Note that A model means additive effects model and A+D model is additive plus dominance effects model.
According to the ranking in Table 2, we choose Bayes C and LMM using the A+D model for trait FWT in the intracrossing groups of C. maxima and C. moshcata, respectively. The resulting correct prediction probability of the cross-validation is summarized in Table 3 for ordinal traits. From the ranking in the table, we simply choose Bayes TB for trait GV and Bayes TA for trait LS using the A+D model in both intracrossing groups.
Table 2 Pearson's correlation and ranking (in parentheses) for the cross-validation of trait fruit weight.
Group | Model† | Estimation method | |||
LMM‡ | Bayes A | Bayes B | Bayes C | ||
C. maxima | A | 0.761 (3) | 0.767 (1) | 0.761 (4) | 0.762 (2) |
A+D | 0.761 (4) | 0.775 (2) | 0.774 (3) | 0.777 (1) | |
C. moschata | A | 0.482 (1) | 0.472 (4) | 0.476 (2) | 0.475 (3) |
A+D | 0.481 (1) | 0.436 (4) | 0.447 (2) | 0.437 (3) |
A, additive effects model; A+D, additive plus dominance effects model.
LMM, linear mixed-effects model.
Combining Ability and HeterosisThere are possible hybrid combinations within the intracrossing groups of C. maxima and C. moschata, respectively. The GEBV-based SCA for each hybrid and GCAs for its parental lines are calculated using the most appropriate estimation methods determined from the cross-validation results. For illustration purposes, we report only the top 25 potential hybrids with large GEBV(ij), together with their associated SCA, GCA, MPH, and BPH in Supplemental Table S1, and the top 10 superior parental lines with large GEBV-based GCAs in Supplemental Table S7 for trait FWT within the C. maxima intracrossing group. The corresponding results for trait FWT within the C. moschata are summarized in Supplemental Tables S2 and S8.
Table 3 Correct prediction probability and ranking (in parentheses) for the cross-validation of ordinal traits growth vigor (GV) and leaf size (LS).
Group | Trait | Model† | Estimation method | ||
Bayes TA | Bayes TB | Bayes TC | |||
C. maxima | GV | A | 0.892 (2) | 0.900 (1) | 0.875 (3) |
A+D | 0.883 (2) | 0.892 (1) | 0.883 (2) | ||
LS | A | 0.875 (1) | 0.875 (1) | 0.867 (2) | |
A+D | 0.883 (1) | 0.875 (2) | 0.875 (2) | ||
C. moschata | GV | A | 0.685 (1) | 0.685 (1) | 0.685 (1) |
A+D | 0.658 (3) | 0.685 (1) | 0.676 (2) | ||
LS | A | 0.711 (1) | 0.711 (1) | 0.711 (1) | |
A+D | 0.711 (1) | 0.711 (1) | 0.711 (1) |
A, additive effects model; A+D, additive plus dominance effects model.
From Supplemental Table S1, there is an important finding that GEBV(ij) are all greater than both GEBV(i) and GEBV(j), meaning that the hybrids have superior performance over the better parental line (all the corresponding BPH(ij) > 0). Besides, the top two hybrids with high phenotypic values in the training data (P026×P236 = 3.73; P026×P234 = 3.60) appear at the first and second rankings among the top 25 potential hybrids. The top five superior parental lines P236, P235, P026, P234, and P028 in Supplemental Table S7 are found to involve the top five potential hybrids P026×P236, P026×P234, P026×P235, P026×P027, and P026×P028 in Supplemental Table S1. Particularly, P026 involves 10 of the top 25 potential hybrids, so it can be a top potential parental line. From Supplemental Table S2, all values for MPH(ij) equal values for SCA(ij), which all equal zero, implying that there is no heterosis effect for trait FWT within the C. moschata intracrossing group. Also, all the values for BPH(ij) are less than zero in Supplemental Table S2, showing that every hybrid has worse performance than at least one of its parental lines. From Supplemental Table S8, P308 could be the best parental line with large FWT, and the hybrids involving P308 are expected to have better performance (Supplemental Table S2). The corresponding results for ordinal traits GV and LS within both intracrossing groups are available in Supplemental Tables S3 through S6 and Supplemental Tables S9 through S12.
Variance Components and HeritabilityWe further estimate the variance components and heritability for trait FWT within each intracrossing group using the g-BLUP model of Eq. [5]. The results are summarized in Table 4. From the table, there exist some non-negligible dominance effects in the C. maxima intracrossing group but none in the C. moschata intracrossing group. The estimates of the variance components of the GCA and SCA effects reflect the results in Table 2 that the two GS models do not have consistent prediction accuracy comparisons between the two intracrossing groups. Also, these variance component estimates explain why the BPH(ij) within the C. maxima group (Supplemental Table S1) are all greater than zero, and the MPH(ij) within the C. moschata group (Supplemental Table S2) are all equal to zero. Finally, the heritability estimate for the C. maxima group is indeed greater than the C. moschata group, so the former has much higher prediction accuracy (Table 2).
DISCUSSIONIn this study, we have demonstrated that the additive–dominance effects model is advantageous over the additive effects model for genomic prediction of hybrid performance. Not only does it have credible prediction accuracy, but it also enables breeders to quantify the degree of MPH or BPH for individual hybrids according to GEBV-based SCAs and GCAs. In practice, we have developed a systemic procedure to carry out genomic prediction for hybrid performance. The proposed analysis method could provide useful information for breeders to select promising hybrid combinations and parental lines from a hybrid breeding program.
Table 4 Estimates of variance components and heritability for trait fruit weight (FWT) in the two intracrossing groups.
Group | † | ‡ | § | h2¶ |
C. maxima | 0.033 | 0.202 | 0.073 | 0.764 |
C. moschata | 0.170 | 0 | 0.694 | 0.197 |
, variance component due to general combining ability (GCA).
, variance component due to specific combining ability (SCA).
, variance component due to nongenetic effects.
h2, heritability, which is calculated by the ratio of genetic variance to the total phenotypic variance .
To explore how the relatedness between training and test populations might influence on the prediction accuracy, we further consider the following studies. Let Tall be the test population and T0, T1, and T2 be three exclusive validation subgroups of Tall. The hybrids in T0, T1, and T2 have 0, 1, and 2, respectively, of their parental lines appearing in hybrid combinations of the training population. In the simulation study, we consider Case 2 of the mediate dominance effects situation, the average prediction accuracy for the validation subgroups are displayed in Table 5. In the pumpkin data analysis, we randomly choose one quarter of observations of trait FWT as a training population within C. maxima and C. moschata intracrossing groups, the remaining observations not selected in the training population serve as the test population. The procedure was repeated 30 times and the average accuracy for the validation subgroups are also displayed in Table 5. As expected, the prediction accuracy of the three validation subgroups for all scenarios under study is ranked as T0 < T1 < T2, which reflects the relatedness and indicates the more precise prediction accuracy among the different validation subgroups. The simulation studies (Technow et al., 2012) and empirical data analysis in maize (Technow et al., 2014) and wheat (Zhao et al., 2013; Kadam et al., 2016) also lead to the consistent results.
There are several factors that might affect the accuracy of genomic prediction such as heritability, marker density, estimation methods for marker effects, training population size, and genomic relationship between training and test populations (Desta and Ortiz, 2014; Zhao et al., 2015; Wang et al., 2018). In this study, we have shown that the prediction accuracy could increase with the increase in heritability and in the degree of relatedness between training and test populations. In addition, we also evaluate some shrinkage estimation methods such as ridge regression, least absolute shrinkage and selection operator (LASSO), and elastic net methods through the same simulation study. The averages of the 100 simulated Pearson's correlation coefficients for these shrinkage estimation methods together with the LMM and three Bayesian methods under Case 2 of the mediate dominance effects situation are displayed in Fig. 2. It is shown that ridge regression appears to have similar prediction accuracy with the LMM and three Bayesian methods. However, LASSO and elastic net methods seem not to perform well. This could be due to the fact that the norm penalty specified for both LASSO and elastic net methods often leads to a sparse solution with just few nonzero marker effects in the model (Wimmer et al., 2013) and their performances are nonsufficient in genomic prediction.
Table 5 The average prediction accuracy for T0, T1, T2, and Tall validation groups in the simulation study and pumpkin data analysis based on the A+D model and Bayes A for the simulation study, Bayes C for C. maxima, and linear mixed-effects model for C. moschata. Note that the A+D model is the additive plus dominance effects model.
Validation group† | ||||||
To | T1 | T2 | Tall | |||
Simulation study | h2 | 0.3 | 0.147 | 0.292 | 0.394 | 0.377 |
0.5 | 0.274 | 0.451 | 0.565 | 0.547 | ||
0.7 | 0.408 | 0.611 | 0.717 | 0.701 | ||
Pumpkin data analysis | – | C. maxima | 0.636 | 0.731 | 0.773 | 0.728 |
C. moschata | 0.052 | 0.350 | 0.392 | 0.331 |
T0, no parent of hybrids appears in the combinations of the training population; T1, only one parent of hybrids appears in the combinations of the training population; T2, both parents of hybrids appear in the combinations of the training population; Tall, all hybrids of the test population.
We now investigate how the marker density may affect the prediction accuracy. As mentioned earlier, the numbers of SNP markers used in the analysis for C. maxima and C. moschata are given by 4521 and 6348 SNPs, respectively. We fit trait FWT by different marker densities, which are determined according to various numbers of SNPs evenly spaced over each chromosome. We also collected phenotypic values of 51 hybrids of the intercrossing group between C. maxima and C. moschata. There were 16,275 SNP markers collected for such more diverse intercrossing group. Similarly, we fit trait FWT by different marker densities for this intercrossing group using Bayes A. The prediction accuracy for the two intra- and the intercrossing groups are displayed in Fig. 3. From the figure, the prediction accuracy generally appears to improve as the marker density increases from the beginning, which could enhance the chance of marker–QTL associations to capture more accurate marker effects (Desta and Ortiz, 2014; Norman et al., 2018). However, the prediction accuracy gradually reaches the plateau as marker density exceeds 1000, 2000, and 4000 SNPs for the C. maxima intracrossing group, the C. moschata intracrossing group, and their intercrossing group, respectively. This indicates that more diverse populations might need more markers to achieve better performance. However, the prediction accuracy tends to slightly decrease as the number of SNPs gets greater than 8000 for the intercrossing group. This could be due to the model overfitting caused from redundant markers (Hickey et al., 2014).
Fig. 2. The averages of the 100 simulated Pearson's correlation coefficients under Case 2 of the mediate dominance effects situation using the A+D model. Note that A+D model is additive plus dominance effects model.
Although the use of ordinal score as a measure can save the cost for phenotyping, this could lead to a loss of information in data analysis. It has been verified that the increase in the number of ordinal levels can enhance prediction accuracy of a GS model (Kizilkaya et al., 2014). Thus, one should increase the number of levels as many as possible when using the ordinal score as a measure. As discussed earlier, Pearson's correlation could be an unsuitable index for assessing the accuracy of genomic prediction for ordinal traits. There is still room for improvement to develop methods for evaluating genomic prediction for ordinal traits. The ranking method proposed by Blondel et al. (2015) could be a promising one.
Fig. 3. The prediction accuracy under different marker densities for the two intracrossing groups of Cucurbita maxima using Bayes C and C. moschata using LMM and the intercrossing group between C. maxima and C. moschata using Bayes A.
There is often a large number of possible hybrid combinations needed to be evaluated in a hybrid breeding program. Namely, the training population size can be much smaller than the test population size. Thus, it is desired to investigate how the size of training population may affect the prediction accuracy. Under Case 2 of the mediate dominance effects situation in the simulation study, we display the prediction accuracy using Bayes C estimation for the training population size ranging from 10 to 1000 in Fig. 4. From the figure, the prediction accuracy gradually reaches the plateau as the training population size exceeds 300. As suggested by Akdemir et al. (2015), determination of an optimal training population could be an economical and efficient way to a successful GS with the limited resources. This will be an issue for future study.
Conflict of InterestThe authors declare that there is no conflict of interest.
Fig. 4. Simulated prediction accuracy under different sizes of training population using the A+D model and Bayes C estimation. Note that A+D model is additive plus dominance effects model.
Supplemental material is available online for this article.
AcknowledgmentsThis research was supported by the Ministry of Science and Technology Taiwan (grant number: MOST 106-2118-M-002-002-MY2).
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2019. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
Genomic prediction has become an increasingly popular tool for hybrid performance evaluation in plant breeding mainly because that it can reduce cost and accelerate a breeding program. In this study, we propose a systematic procedure to predict hybrid performance using a genomic selection (GS) model that takes both additive and dominance marker effects into account. We first demonstrate the advantage of the additive–dominance effects model over the only additive effects model through a simulation study. Based on the additive–dominance model, we predict genomic estimated breeding values (GEBVs) for individual hybrid combinations and their parental lines. The GEBV‐based specific combining ability (SCA) for each hybrid and general combining ability (GCA) for its parental lines are then derived to quantify the degree of midparent heterosis (MPH) or better‐parent heterosis (BPH) of the hybrid. Finally, we estimate the variance components resulting from additive and dominance gene action effects and heritability using a genomic best linear unbiased predictor (g‐BLUP) model. These estimates are used to justify the results of the genomic prediction study. A pumpkin (Cucurbita spp.) data set is given to illustrate the provided procedure. The data set consists of 320 parental lines with 61,179 collected single nucleotide polymorphism (SNP) markers; 119, 120, and 120 phenotypic values of hybrids on three quantitative traits within C.maxima Duchesne; and 89, 111, and 90 phenotypic values of hybrids on the same three quantitative traits within C. moshata Dechesne.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Details
1 Dep. of Agronomy, National Taiwan Univ., Taipei, Taiwan
2 Known‐You Seed Co., Ltd., Taiwan