Genomic prediction models for grain yield of spring bread wheat in diverse agro-ecological zones
OPEN
R
C. Saint Pierre,*, J. Burgueo,*, J. Crossa,*, G. Fuentes Dvila, P. Figueroa Lpez,E. Sols Moya, J. Ireta Moreno, V. M. Hernndez Muela, V. M. Zamora Villa, P.Vikram,K. Mathews, C. Sansaloni, D. Sehgal, D. Jarquin, P.Wenzl & Sukhwinder Singh
Genomic and pedigree predictions for grain yield and agronomic traits were carried out using high by several environmental co-variables. Seven statistical models were tested using two random cross-validations schemes. Two other prediction problems were studied, namely predicting the lines performance at one site with another (pairwise-site) and at untested sites (leave-one-site-out). Grain across sites. The best predictability was observed when genotypic and pedigree data were included in the models and their interaction with sites and the environmental co-variables. The leave-one-site-out increased average prediction accuracy over pairwise-site for all predictions had high heritability and gave the highest accuracy for prediction models. Genomic and pedigree models coupled with environmental co-variables gave high prediction accuracy due to high genetic correlation between sites. This study provides an example of model prediction considering climate data along-with genomic and pedigree information. Such comprehensive models can be used to achieve rapid enhancement of wheat yield enhancement in current and future climate change scenario.
Global wheat production is currently close to 700 million tons1, and the demand for wheat in developing countries is projected to increase 60% by 20502. Wheat grain yield is a complex trait that depends on multiple genes interacting with each other and the environment3,4. Although the eects of major genes regulating plant phenology and morphology and their inuence on grain yield have been previously described5, quantitative trait loci (QTLs) for grain yield have had limited practical applications in breeding programs due to the small genetic variance accounted for by individual QTLs, the variation across environments4, and the inuence of the genetic backgrounds.
Recent advances in sequencing technologies have enabled the generation of high throughput, fast, and relatively inexpensive genotypic information; thereby facilitating the implementation of genomic prediction and genomic selection in plant and animal breeding6. Incorporation of genomic information through prediction models provides an alternative approach to indirect selection in breeding for crop varieties. Given that plant breeding programs started to incorporate genomic information, parametric linear regression and non-parametric models have emerged as preferred methods7,8. However, the genetic instruction from genes translates into the full set of phenotypic traits and ultimately into grain yield components is aected by numerous interactions among pathways and the environment. Genotype by environment interactions (G E) can reduce trait heritability and the ability to statistically predict superior genotypes under contrasting environments9,10. For this reason, collecting phenotypic data from dierent environments continues to be a powerful predictor of important biological outcomes such as grain yield11. Although dierent genomic technologies are being utilized to breed suitable varieties, genomic selection provides the option of considering multiple variables simultaneously for predicting genetic yield potential10.
*
SCIENTIFIC REPORTS
1
www.nature.com/scientificreports/
Sites Celaya Delicias Tepatitln Cd. Obregn Zaragoza
Grain yield/Plant height
Celaya 3.43/147 0.899 0.937 0.843 0.859 Delicias 0.514 2.39/70 0.849 0.837 0.930 Tepatitln 0.735 0.389 2.43/121 0.857 0.827 Cd. Obregn 0.849 0.444 0.697 3.29/93 0.797 Zaragoza 0.832 0.446 0.794 0.829 0.48/123 Days to maturity/Days to anthesis
Celaya 46.5/37.5 0.851 0.875 0.875 0.834 Delicias 0.876 25.6/32.6 0.894 0.949 0.928 Tepatitln 0.874 0.838 11.7/69.8 0.952 0.898 Cd. Obregn 0.881 0.846 0.846 68.7/74.9 0.921 Zaragoza 0.750 0.729 0.738 0.994 12.4/48.8
Table 1. Genetic variances and correlations between sites for grain yield, plant height, days to anthesis and maturity. For grain yield and plant height the diagonal shows the variance components (./.), the lower diagonal the correlation between sites for grain yield, and the upper diagonal the correlation between sites for plant height. For days to maturity-days to anthesis the diagonal shows the variance components (./.), the lower diagonal the correlation between sites for days to maturity and the upper diagonal the correlation between sites for days to anthesis.
Pedigree information accounts for the proportion of predictive ability related to dierences in families and increases prediction accuracy when used together with marker information (that accounts for Mendelian sampling) in genomic selection models12. Burgueo et al.9 demonstrated the superiority of pedigree plus genomic models over pedigree or genomic-based predictions alone when incorporating G E in the genomic regression model. Jarquin et al.13 proposed a model that can use not only genomic information but also pedigree and environmental information for the prediction of unobserved genotypes. Data from multi-environment trials can also be used for predicting climate change scenarios and selecting suitable sites for evaluating promising germplasm. Including environmental covariables in genomic selection prediction models is expected to result in less biased estimation of eects, higher prediction accuracy, better precision and power, and increased heritability to explain grain yield variation14. This information facilitate selection of promising germplasm for use in crop breeding aimed at both population improvement and cultivar release.
Cross-validation schemes are used in genomic prediction studies to estimate the accuracy with which predictions can be made for dierent traits and environments9,1522. There are two basic cross-validation schemes used in genome-enabled prediction: (1) predicting the performance of certain proportion of lines that have not been evaluated in any of the observed environments (CV1), and (2) predicting the performance of a proportion of lines that have been evaluated in some environments, but not in others, also called sparse testing (CV2). Another prediction problem that does not involve random cross-validation is predicting one environment using another environment (pairwise environment). The fourth prediction problem consists of predicting one environment (i.e., site-year combination) that was not included in the usual set of testing environments in the evaluation system (leave-one-environment-out); the only available information on this untested environment could be certain characteristics that would have been previously collected such as soil type, altitude, longitude, maximum and minimum temperature, precipitation during other cropping cycles, etc. It is expected that predicting the performance of untested lines can be conducted with sufficient accuracy when there is knowledge about their relationships (pedigree relationship or genomic relationship). Similarly, the performance of lines in unobserved environments could be predicted if there is information about the environmental conditions17. The accuracy of predicting performance in unobserved environments would however be related to our ability to select the most appropriate environmental variables for inclusion in the prediction model. To date, this would be the rst study assessing the prediction problems when leaving-one-environment-out with real environmental data.
In light of the facts mentioned above, the following objectives of the present study were framed: 1) to investigate the stability performance of wheat lines across a set of 5 Mexican environments; 2) to evaluate genomic prediction with high density genotype-by-sequencing (DArTseq) markers for agronomic traits and grain yield using dierent combinations for the eects of lines (L), sites (E), genomic data (G), pedigree data (A), and environmental covariables (W) and their interactions; and 3) to test a new problem that arises when predicting the performance of wheat lines in environments that have not been previously used (untested environments) where the only available information from them is their climate data.
Results
Genetic variance of site and genotypic correlation between sites. High genetic correlations were observed among sites for days to anthesis, days to maturity, plant height, and for grain yield at most of the pairwise sites (Table1). Broad sense heritability for plant height, days to anthesis, and maturity in all the environments, was relatively higher than that of grain yield, except in Celaya (data not presented).
Phenotypic variability of the traits measured across sites. Sites represented dierent wheat growing conditions in Mexico, from 39 meters masl (Cd. Obregn) to 1,930 masl (Tepatitln) and dierences in latitude of 8 degrees (Fig.1, Supplementary Table 1). Average minimum, mean and maximum temperatures during
SCIENTIFIC REPORTS
2
www.nature.com/scientificreports/
Figure 1. Geographic distribution of the ve sites where the eld trials were conducted . Map was constructed using ESRIs ArcGIS Desktop ArcMap 10.2.2 soware (26).
Figure 2. Environmental co-variables used to build the W matrix, including i) Average minimum temperature at vegetative stage, ii) Average minimum temperature during crop cycle, iii) Average maximum temperature during grain lling, iv) Average maximum temperature during crop cycle, and v) Average temperature during crop cycle. Sites refer to Celaya; Delicias; Tepatitln; Ciudad Obregn and Zaragoza.
dierent critical phases of the crop cycle are presented in Fig.2. Celaya was the warmest site during the rst stage of the crop and one of the coldest during grain lling. Conversely, Delicias was the coldest site during the vegetative stage and the warmest site during the grain lling period. Mean grain yield varied signicantly across environments, ranging from 3.7 t ha1 at Zaragoza to a maximum of 9 t ha1 at Tepatitln. For grain yield, a combined analysis of genetic correlations and genetic variances revealed positive genetic correlations between sites, as shown in the bi-plot (Fig.3). Figure3 revealed clustering of sites into one group. Similarly, positive genetic correlations were illustrated by the vectors from the origin of graph to the sites (letters), when ranking genotypes on grain yield. The separation of the sites (see letters in the graph) from the origin (center of the graph) is an indicator of the higher heritability values for those sites and thus, a measure of the sites eective discriminating
SCIENTIFIC REPORTS
3
www.nature.com/scientificreports/
Figure 3. Bi-plot for grain yield. The red codes represent each of the sites. The points represent each of the lines. The lines closest to the end of the site vector are the best performing lines for that specic site.
Figure 4. Bar plot of the prediction accuracy for all pairwise-site arranged by their genetic correlations for grain yield. The heritability for each site within the pair is represented by a circle. For each pair-wise the prediction is direct and reciprocal. Abbreviations: CE: Celaya; DE: Delicias; TE: Tepatitln; OB: Cd. Obregn; ZA: Zaragoza.
power. Celaya and Cd. Obregn showed the maximum separation from the origin and, were therefore, the most eective sites for identifying genetic dierences between genotypes. It is important to note that the temperature regimes in Celaya and Cd. Obregn were similar during anthesis, as shown in Fig.2.
Genotypes do not show any clear pattern in the bi-plot for grain yield (Fig.3). Most of the genotypes are located in a cloud at a value of zero and right in the rst dimension and a le tail of genotypes. Similar results were found for the other traits analyzed (bi-plot not presented).
For grain yield the sites were intermediately to highly genetically correlated (0.40.85) (Table1). The correlations between the pair of sites are related to the prediction accuracy for each pairwise-site correlation as depicted in Fig.4. For example for the pair of sites with the highest correlation (0.85), Celaya predicts Cd. Obregn well but
SCIENTIFIC REPORTS
4
www.nature.com/scientificreports/
Genomic model
L+E+W E+W+G E+W+G+GE E+W+A E+W+A+AE E+W+G+A E+W+G+A+GE+AE CV1
Celaya
0.065 0.130 0.131 0.190 0.189 0.189 0.199(0.035) (0.011) (0.023) (0.010) (0.017) (0.009) (0.019)
0.046 0.260 0.291 0.299 0.258 0.319 0.311(0.044) (0.008) (0.012) (0.007) (0.015) (0.007) (0.014)
0.060 0.444 0.481 0.523 0.571 0.538 0.581(0.040) (0.012) (0.010) (0.009) (0.010) (0.009) (0.010)
0.020 0.198 0.228 0.221 0.209 0.238 0.252(0.035) (0.009) (0.015) (0.009) (0.018) (0.007) (0.016)
Site
CV2
Celaya 0.397 0.461 0.433 0.491 0.478 0.484 0.466
(0.004) (0.004) (0.009) (0.005) (0.008) (0.004) (0.008)
Delicias 0.221 0.218 0.214 0.244 0.248 0.241 0.249
(0.003) (0.004) (0.018) (0.005) (0.015) (0.004) (0.018)
Tepatitln 0.324 0.352 0.384 0.350 0.318 0.364 0.367
(0.003) (0.005) (0.010) (0.005) (0.013) (0.004) (0.012)
Cd. Obregn 0.423 0.517 0.536 0.563 0.604 0.558 0.613
(0.007) (0.007) (0.009) (0.006) (0.006) (0.006) (0.006)
Zaragoza 0.270 0.279 0.303 0.271 0.267 0.286 0.282
(0.001) (0.004) (0.007) (0.004) (0.008) (0.003) (0.008)
Table 2. Average correlation (and standard deviation in parenthesis) between observed and predicted values for two Random cross-validation schemes for trait grain yield (YLD) in ve sites in Mexico; L=line, E=site, G=genotypic information, A=pedigree, W=environmental variables, and interactions. Cross-Validation CV1: predictions when a proportion of lines are not included in any of the 5 sites; Cross-Validation CV2: predictions when a proportion of lines are removed from some of the sites and le in others.
Cd. Obregn predicts Celaya with slightly less accuracy. Furthermore, Cd. Obregn and Zaragoza had a genetic correlation of 0.829 and Zaragoza predicts Cd. Obregn well, but not vice versa.
Genomic prediction analysis for grain yield and phenology. Among the seven tested prediction models for grain yield, the models E+W+G+A and E+W+G+A+GE+AE performed better than other models in cross validation schemes CV1 and CV2, respectively (Table2). The highest correlation value in CV1 and CV2 was obtained in Cd. Obregon, followed by Celaya. Though not absolutely, these two models (E+W+G+A and E+W+G+A+GE+AE) performed better than other models for highly heritable traits, i.e., days to anthesis and days to maturity and plant height (Supplementary Table 2). In terms of sites, comparatively better predictions were observed when Celaya was used as a training set to predict Cd. Obregn (Table3). Results clearly revealed that aer including the G (genomic data) or A (pedigree information) matrix in the model, prediction ability increased.
Pairwise-site prediction accuracy for grain yield is shown in Table3 with a noticeable increase at most the sites in the prediction accuracy of models E+W+A and E+W+G+A over the other two models. Model E+W+ A was the best model when Celaya, Delicias and Tepatitln were used as training sets, while model E+W+A+ G was better when the training sets were Cd. Obregn and Zaragoza. Celaya and Cd. Obregn were always the best predicted sites. Compared to grain yield (Table3), higher pairwise site predictions were observed for days to anthesis (Supplementary Table 3), days to maturity (Supplementary Table 4), and plant height (Supplementary Table 5). Accuracy of the prediction models values was higher than 0.54 for plant height and days to anthesis, whereas correlations ranged from 0.548 to 0.777 and from 0.613 to 0.749, respectively.
Grain yield predictions in untested environments (leave-one-out, Table4) were performed using site information, environmental variables, pedigree, genotypic data, and pedigree by site and genomic by site interactions (E+W+A, E+W+A+AE, E+W+G+A, and E+W+G+A+GE+AE). Interestingly, leave-one-out accuracy overcomes pairwise-site accuracy indicating that four sites predict better one site than the pairwise-site comparison. Traits with higher heritability, as days to anthesis, maturity and plant height, were the ones best predicted by the leave-one-out (Supplementary Table 6). Among the seven tested models better results were obtained when predicting Celaya and Cd. Obregn for models E+W+G+A, E+W+A+AE, and E+W+G+A+GE+AE.
Average accuracy of including information from four sites (leave-one-site-out) increased from 0.66 to 0.76, 0.70 to 0.78, 0.41 to 0.57, and 0.27 to 0.36 for plant height, days to anthesis, days to maturity, and grain yield, respectively, relative to pair-wise comparisons (comparison of average values, Supplementary Tables 36, and Tables3 and 4). Modelling the interactions in E+W+G+GE, and E+W+A+AE did not increase the prediction
0.062 0.369 0.364 0.429 0.427 0.432 0.419(0.043) (0.012) (0.013) (0.009) (0.010) (0.010) (0.011)
Delicias
Tepatitln
Cd. Obregn
Zaragoza
SCIENTIFIC REPORTS
5
www.nature.com/scientificreports/
Genomic
Model Testing site
L+E+W
E+W+G
E+W+A
Celaya 0.286 0.377 0.449 0.310 Delicias 0.203 0.203 0.180 0.174 Tepatitln 0.324 0.267 0.290 0.276 Cd. Obregn 0.515 0.285 0.404 0.440
Zaragoza 0.206 0.177 0.228 0.256
Table 3. Pair-wise correlation between the observed and predicted values for grain yield for four models; L=line, E=site, G=genotypic information, A=pedigree, W=environmental variables. Values from one site (training site) were used to predict a second site (testing site).
Genomic model
L+E+W E+W+G E+W+G+GE E+W+A E+W+A+AE E+W+G+A E+W+G+A+GE+AE Celaya 0.403 0.459 0.440 0.488 0.466 0.480 0.435 Delicias 0.225 0.212 0.224 0.235 0.240 0.231 0.228 Tepatitln 0.327 0.341 0.350 0.352 0.354 0.358 0.358Cd. Obregn 0.432 0.487 0.438 0.513 0.497 0.511 0.516 Zaragoza 0.272 0.277 0.283 0.271 0.270 0.280 0.286
Table 4. Correlation between the observed and predictive values of the leave-one-site-out prediction problem (prediction of one site when all the other sites are used in the model); L=line, E=site, G=genotypic information, A=pedigree, W=environmental variables and the interactions.
accuracy, whereas the main eect model E+W+G+E and the complete interaction model E+W+G+A+GE+ AE increased the prediction of Celaya and Cd. Obregn for days to anthesis and maturity.
Discussion
The identication of wheat genotypes with stable performance in diverse environments is a challenge for breeders, especially in countries where wheat can be grown in diverse agro-ecological zones with high soil diversity and various patterns of precipitation and temperature. In this study, performance of diverse wheat lines was screened at multiple sites, encouraging local breeders to evaluate diverse germplasm in their environments and with their best management practices. By growing the lines in dierent environments, we expected to include in predictive model the environmental factors inuencing the yield ranking of cultivars from site to site. The trait and site analysis are the important pre-requisites for determining the performance of genotypes across environments. In this investigation, all tested sites were positively correlated, i.e. in the same area of the bi-plot (Fig.3). Also, most of genotypes were grouped in the center of the bi-plot, indicating for their similar response across the sites.
As expected, screening ability was highest for sites with no major prevailing abiotic and biotic stresses23.
Celaya and Cd. Obregn had the highest capacity for discriminating performance by genotype, and thus, ideal for the selection of superior lines. Cd. Obregn, a temperate high-radiation irrigated environment, and one of the CIMMYTs principal test sites, has been identied as one of the most suitable environments for screening under optimal conditions and for simulating dierent environmental stresses (e.g. drought, heat). It was interesting to note that sites Celaya and Cd. Obregn which showed the maximum separation from origin, resembled temperature regimes during grain lling (Fig.2). This contributes to a comparable heritability pattern in these two sites for traits days to maturity, days to owering, and plant height.
Training site
Celaya Delicias Tepatitln Cd. Obregn Zaragoza
Celaya 0.159 0.267 0.376 0.148 Delicias 0.157 0.126 0.194 0.103 Tepatitln 0.267 0.126 0.261 0.208 Cd. Obregn 0.377 0.193 0.262 0.247
Zaragoza 0.148 0.105 0.211 0.245
Celaya 0.273 0.346 0.422 0.288 Delicias 0.183 0.164 0.166 0.136 Tepatitln 0.305 0.230 0.280 0.252 Cd. Obregn 0.480 0.267 0.373 0.398
Zaragoza 0.201 0.153 0.211 0.248
Celaya 0.293 0.404 0.450 0.295 Delicias 0.209 0.255 0.176 0.190 Tepatitln 0.325 0.281 0.277 0.255 Cd. Obregn 0.529 0.294 0.422 0.428
Zaragoza 0.206 0.179 0.209 0.249
E+W+G+A
Sites
SCIENTIFIC REPORTS
6
www.nature.com/scientificreports/
Genomic predictions have been performed in wheat for agronomically relevant traits24 with aim to accelerate genetic gains. High quality predictions with high accuracy for genomic selection programs can be expected at the sites with the highest heritability (Celaya and Cd. Obregn). This is particularly important, considering that investments in high quality phenotyping are needed to fully utilize its potential to complement genome sequencing as a route to rapid advances in breeding. However, the increasing temperatures witnessed over the past decade have been identied as one of the limiting factors that signicantly reduce wheat production in this area of Mexico (Celaya and Cd. Obregn). Lobell et al.25 reported 712% yield losses for northwest Mexico for each degree Celsius rise in temperature. An integrated approach combining the latest genomics resources with physiological research26 would be needed to understand complex quantitative traits like grain yield under the environmental constraints resulting from climate change. Environment descriptors are easily available nowadays, increasing the opportunities for using multiple sources of information and variables of dierent nature to improve the model. However, it is reasonable to use biologically relevant covariables, related to specic plant functions. In a nutshell, genomic selection for grain yield would be more eective for sites that are showing high heritability/ repeatability and are less aected by biotic/abiotic stresses. Environmental variables can play an important role in determining success of the prediction models. In this study we report the rst attempt to predict performance of genotypes in unobserved environments by modeling, thereby incorporating the environment eect in prediction.
Results showed that the prediction models that simultaneously included site (E), genomic and pedigree (G, A), and environmental data (W) consistently gave higher predictions for both CV1 and CV2, pairwise-site, and leave-one-site-out. This study indicates that accounting for environment data increases the predictive ability of the model using random cross-validation. This conclusion concurs with the ndings of Jarquin et al.13 in wheat trials and of Crossa et al.27, where increases in prediction accuracies were achieved by including dense molecular markers and G E in a set of Mexican and Iranian landraces. Our study therefore provides a proof of concept that incorporating environmental variables in prediction models enhances their power ultimately making them more suitable and practical for climate resilient wheat improvement. A systematic robust analysis involving wheat mega-environments (other than Mexico) will ensure a wide spread application of this comprehensive research approach.
Predicting the performance of lines that have never been evaluated in the eld (CV1) was more challenging than predicting the performance of lines that were evaluated in dierent environments (CV2). In this study, prediction accuracy from CV2 was higher than those obtained in CV1, indicating the contribution of the information from correlated environments when predicting yield performance (Table2). In addition to these prediction problems, this study evaluated the predictions for dierent traits in untested environments concluding that environments where no genotypes were previously evaluated can still be predicted with good accuracy. However, environmental covariables from the untested environments are required and positive correlation between environments is still an important factor for achieving good prediction accuracy of unobserved environments. In a recent article, Jarquin et al.28, optimized training sets for genomic prediction of soybean accessions using independent validation trials such as leave-one-site-out with no environmental covariables; the authors show high prediction accuracy for % protein and grain yield.
Overall, results suggested that eorts on genomic selection for grain yield must include interdisciplinary teams and collaborative projects, with cross-validation protocols helping to test the potential accuracy of predictions. Simultaneously, the selection of appropriate sites for screening germplasm need to be decided appropriately when applying genomic selection in germplasm enhancement programs for fast track-efficient-precision breeding.
Materials and Methods
Plant material. A set of 803 spring wheat lines (Triticum aestivum L.) was selected from various sources, including CIMMYT International Nurseries (elite germplasm) and the Generation Challenge Program spring wheat reference set, a panel including diverse accessions with potential for favorable allele mining.
Climatic data. The study was conducted under optimal conditions at ve dierent environments (i.e. ve sites, Fig.1) in Mexico during 201112. Map in Fig.1 was constructed using ESRIs ArcGIS Desktop ArcMap 10.2.2 soware2729. The list of sites, coordinates for each site (latitude, longitude, and altitude), wheat cycle data (sowing and harvesting date), and meteorological data from the nearest meteorological station (including average, maximum and minimum temperature) are shown in Supplementary Table 1. Sites covered a wide range of environmental conditions in Mexico: altitude ranged from 39 to 1930 masl and latitude ranged from 2028 degrees N. The average temperature during the season was 19.7C; minimum temperature was 11.1C and maximum temperature was 28.7C.
Planting dates varied from Nov 2011 to Jan 2012. All trials were gown under fully irrigated conditions with adequate pest control. Manual and/or chemical weed control was also applied as required. Seeds were sown in two row plots of length 1.0m and width 0.8m, with 0.2m between rows. Seeding rate was approximately 150grams m2.
A partially replicated experimental design (p -rep) in augmented blocks was used, where 81% of the accessions were repeated once, 15% were repeated twice, 4% were repeated three or more times, and 6% of the plots were used with checks.
Phenotypic trait evaluation. Measurements were taken according to the protocols detailed in Pask et al.30. Days to anthesis was recorded as the number of days from planting until >50% of the spikes in each plot had completely emerged above the ag leaves and owering had begun in the middle of the head. Days to maturity was similarly recorded as the number of days from planting until 50% of the peduncles in each plot had turned yellow. Plant height was the distance from the soil surface to the tip of the spike (excluding awns), taken as the average of three values for each plot in the eld. Grain yield was the total weight of seed in each plot, divided by the plot area and expressed as t ha1.
SCIENTIFIC REPORTS
7
www.nature.com/scientificreports/
Genotypic characterization. Genomic DNA was extracted from fresh leaves using a modified cetyltrimethyl-ammonium bromide method31. DNA quality and concentration were determined by electrophoresis in 1% agarose gel. A high-throughput genotyping method using DArT-SeqTM technology32,33 was employed to generate genomic proles of the population presented in this study. A complexity reduction method including two enzymes (PstI and HpaII) was used to create a genome representation of the set of samples3234. PstI-RE site
specic adapter was tagged with 96 dierent barcodes enabling multiplexing a 96-well microtiter plate with equi-molar amounts of amplication products in order to run within a single lane on Illumina HiSeq2500 instrument (Illumina Inc., San Diego, CA). The successfully amplied fragments were sequenced up to 77 bases, generating approximately 500,000 unique reads per sample. Thereaer the FASTQ les (full reads of 77 bp) were quality ltered using a Phred quality score of 30, which represent a 90% of base call accuracy for at least 50% of the bases. More stringent ltering was also performed on barcode sequences using a Phred quality score of 10, which represents 99.9% of base call accuracy for at least 75% of the bases. A proprietary analytical pipeline developed by DArT P/L was used to generate allele calls for SNP and presence/absence variation (PAV) markers. Then, a set of ltering parameter was applied to select high quality markers for this specic study. One of the most important parameters is the average reproducibility of markers in technical replicates for a subset of samples which was set at 99.5%. Another critical quality parameter is call rate. This is the percentage of targets that could be scored as 0 or 1, the threshold was set at 50%.
Data analysis
Analysis of phenotypic data and GE interaction. Individual analysis of sites was performed using a mixed linear model in order to obtain the best linear unbiased prediction (BLUP) and trait heritability. Group eects were determined by the entries classied as checks versus accessions. Components of variance were also estimated. Days to heading, days to maturity, plant height, and grain yield were analyzed using a mixed linear model in ve environments.
The linear mixed model for the combined analyses is:
= + + +
Z Z
Y X (1)
1 2
where Y is the vector of response variable, X is the incidence matrix of xed eects (sites), is the vector of eects of environments, Z1 is the incidence matrix of random eects of block nested in sites, is the vector of eects of blocks nested in sites, Z2 is the incidence matrix of random eects of genotype nested in sites, is the vector of eects of genotype by site interaction and is the experimental error
^ 0 I
N( , ) (2)
d
2
N( , ) (3)
^ 0 I
N( , ) (4)
= +
G
( ) (5)
where is the loading matrix of s (number of sites) rows by number of factors (2) columns, is a diagonal matrix containing site specic variances and G is the matrix of relationships between genotypes obtained from the marker matrix. This model is known as the factor analytical model. It is able to model the environmental component of the G E interaction in a suitable way to interpret it and borrowing information between correlated sites. Inclusion of the G matrix, produces more reliable results with lower standard error of the BLUPs. Three checks were included, their eects and the eects of the accessions were estimated separately as well as their interactions with the sites.
Predictive Statistical Models. The models considered dierent combinations for the eects of lines (L), sites (E), genotypic data (G), pedigree (A), and environmental variables (W). Further details of the models outlined below can be found in Jarquin et al.13. We initially described the baseline model and then seven reaction norm models using pedigree and genomic relation matrices as well as environmental covariates.
Baseline model. The phenotypic response variable (yijk) is described as the sum of an overall mean () plus random deviations due to the environment Ei (i=1, I) and the line Lj (j=1, I), plus an error term ijk(k=1, rij). The linear mixed eects models is
= + + +
y E L (6)
ijk i j ijk
2 2 and
ID 2 and N(.,.) denotes a normal density and IID stands for independent and identically distributed.
+E+W). Environmental co-variables (EC) are introduced in the baseline model. We add in equation [6] a random regression on the ECs (W) that describes the environmental conditions faced by each line in each environment, that is:
= =
E N(0, )
ijk
1 , where Wijq is the value of the qth EC evaluated in the ij environment line combination, q is the main eect of the corresponding EC, and Q is the total number of EC. We
^ 0
2
where
E L
N(0, ), N(0, )
i
ID
ID
E j
L
w W
ij q
Q
ijq q
SCIENTIFIC REPORTS
8
www.nature.com/scientificreports/
regarded the eects of the ECs as IID draws from normal densities, that is:
q
= + + + +
y E L w (7)
ijk i j ij ijk
+W+G). When markers are available we replace in equation [7] the random eect of the line (Lj) with a regression on marker covariates of the form: = =
p
g x b ,
j m
jm m
2 , (m=1, ,p).
The vector g = Xb containing the genomic values of all the lines follows a multivariate normal density with null mean and covariance-matrix
b N(0, )
m
b
2, where G is a genomic relationship matrix whose entries are given by G=(XX)/(p)(Van Raden, 2008). Thus, we have the standard GBLUP model plus the random environmental eect (Ei) and the eects of the EC (wij):
Cov g G
( ) g
= + + + +
y E g w (8)
ijk i j ij ijk
Note that the eects of the level of the random eects g=(g1, .., gJ) are correlated according to the o-diagonal values of G. There is thus the potential to borrow information across lines allowing, for example, prediction of the performance of lines that have not yet been evaluated in any eld trial.
+W+G+GE). Adding to model 2 the interaction between genomic (markers) and environments we developed model 3. Jarqun et al.13 showed that, under standard assumptions, the covariance structure induced by the reaction-norm model is the Hadamard (cell-by-cell) product of two (co)variance structures one describing the relationships between lines based on genetic information, e.g., G, and one describing environmental eects (Ei). We extended the model in equation [8] by adding a new random eect representing interactions between the genomic and the environmental eects, gE such that
Z Z Z
gE 0 G Z
N( , [ ]o[ ] )
g g E
E gE
2 where o stands for the
Hadamard product and gE2 is the genomic environment interaction parameter. Then the model becomes:
= + + + +
y E g gE (9)
ijk i j ij ijk
+W+A). A modication of model 2 is to incorporate pedigree information using the additive relationship matrix A (aj). The model becomes:
= + + + +
y E a w (10)
ijk i j ij ijk
The vector a = (a1, .., aJ) contains the additive random eect of the lines and it is assumed to have a normal distribution
where ostands for the Hadamard product and aE2 is the pedigree environment interaction parameter. Then the model becomes:
= + + + +
y E a aE (11)
ijk i j ij ijk
+W+G+A). This random linear model has only main eects environments, C, genomic and pedigree.
= + + + + +
y E w g a (12)
ijk i ij j j ijk
+W+G+A+GE+AE). This model has the four main eects (Ei, wij, gj, and aj) and the two possible interactions (gEij and aEij)
= + + + + + +
y E a g aE gE (13)
ijk i j j ij ij ijk
Following Burgueo et al.9, we initially considered two distinct prediction problems by cross-validation 1 (CV1) and cross-validation 2 (CV2). Cross-validation CV1 measures the ability of models to predict the performance of a subset of lines that have not yet been evaluated in any of the environments included in the multi-environment trials. CV2 measured the ability of models to predict the performance of lines using data collected in sparse environments. In
IID 2 . Therefore, the vector w=W follows a multivariate normal density with null mean and a covariance matrix proportional to WW. This covariance structure describes the similarity between environmental conditions.
Therefore, when the eects of the EC are added to equation [6] the model becomes
N(0, )
with
N( , )
w
w 0
2
.
1 where gj represents an approximation of the true genetic value of the jth line, xjm is the genotype of the jth line at the mth marker, and bm is the eect of the mth marker. We regarded marker eects as IID draws from normal distributions of the form
ID
=
with
g G
with N(0, )
g
2 .
2 where a2 is an additive variance parameter.
+W+A+AE). Similar to model 2 but incorporating the random interaction eects between the pedigree of the lines (aj) and the eect of the environments (Ei), with aE such that
2
Z Z Z
aE 0 A Z
N( , [ ]o[ ] )
g g E
E aE
a 0 A
N( , )
a
SCIENTIFIC REPORTS
9
www.nature.com/scientificreports/
CV1 we randomly assigned lines to folds, thus ensuring that all the records of a given line were assigned to the same fold. In CV2 we randomly assigned individual plot records to folds; with this setting individual records of a given line are potentially assigned to dierent folds. The size of the training-testing sets for the two random cross-validations was of 8020%. For CV1, 20% of the lines (around 160 wheat lines) were not observed in any of the 5 Mexican sites and for CV2, some of the 20% of the lines were observed in some sites but not in the others.
Another prediction problem studied was the direct prediction of one site using another site (pairwise-site) for all pair of sites. A new prediction problem was studied and denoted as leave-one-site-out; this was added to explain the ability of the model to predict the performance of wheat lines in environments that were not used in the training and where the only available information from them is the collected climatic data. The leave-one-site-out diered from the pairwise-site because four environments were used to predict another one.
References
1. Food and Agriculture Organization, FAOSTAT: Statistical Databases Agriculture Data, URL: http://faostat.fao.org/site/291/default. aspx (2010). (Accessed: 20 April 2015).
2. Alexandratos, N. & Bruinsma, J. World agriculture towards 2030/2050: the 2012 revision. ESA Working paper No. 12-03. Rome, FAO. URL: http://www.fao.org/docrep/016/ap106e/ap106e.pdf (2012). (Accessed: 20 April 2015).
3. Wu, X., Chang, X. & Jing, R. Genetic insight into yield-associated traits of wheat grown in multiple rain-fed environments. PLoS ONE 7, e31249 (2012).
4. Bonneau, J. et al. Multi-environment analysis and improved mapping of a yield-related QTL on chromosome 3B of wheat. Theor. Appl. Genet. 126, 747761 (2013).
5. Reynolds, M. et al. Raising yield potential in wheat. J. Exp. Bot. 60, 18991918 (2009).6. Meuwissen, T. H., Hayes, B. J. & Goddard, M. E. Prediction of total genetic value using genome-wide dense marker maps. Genetics 157, 18191829 (2001).
7. Massman, J. M., Jung, H. J. G. & Bernardo, R. Genome wide selection versus marker-assisted recurrent selection to improve grain yield and stover-quality traits for cellulosic ethanol in maize. Crop Sci. 53, 5866 (2013).
8. de los Campos, G., Prez, P., Vazquez, A. I. & Crossa, J. Genome-Enabled Prediction Using the BLR (Bayesian Linear Regression) R-Package. Genome-Wide Association Studies and Genomic Prediction, Methods in Molecular Biology Series Vol. 1019 (eds C. Gondro, J. van der Werf & B. Hayes) Ch. 12, 299320 (Humana Press 2013).
9. Burgueo, J., de los Campos, G. D. L., Weigel, K. & Crossa, J. Genomic prediction of breeding values when modeling genotypeenvironment interaction using pedigree and dense molecular markers. Crop Sci. 52, 707719 (2012).
10. Bassi, F. M., Bentley, A. R., Charmet, G., Ortiz, R. & Crossa, J. Breeding schemes for the implementation of genomic selection in wheat (Triticum spp.). Plant Sci. 242, 2336 (2016).
11. Houle, D., Govindaraju, D. R. & Omholt, S. Phenomics: the next challenge. Nat. Rev. Genet. 11, 855866 (2010).12. Crossa, J. et al. Genomic prediction in CIMMYT maize and wheat breeding programs. Heredity 112, 4860 (2014).13. Jarquin, D. et al. A reaction norm model for genomic selection using high-dimensional genomic and environmental data. Theor. Appl. Genet. 127, 595607 (2014).
14. Brachi, B., Morris, G. P. & Borevitz, J. O. Genome-wide association studies in plants: the missing heritability is in the eld. Genome Biol. 12, 232 (2011).
15. de los, Campos et al. Predicting quantitative traits with regression models for dense molecular markers and pedigree. Genetics 182, 375385 (2009).
16. de los Campos, G., Gianola, D., Rosa, G. J. M., Weigel, K. & Crossa, J. Semi-parametric genomic-enabled prediction of genetic values using reproducing kernel Hilbert spaces methods. Genet. Res. 92, 295308 (2010).
17. Crossa, J. et al. Prediction of genetic values of quantitative traits in plant breeding using pedigree and molecular markers. Genetics 186, 713724 (2010).
18. Crossa, J. et al. Genomic selection and prediction in plant breeding. J. Crop Improv. 25, 239261 (2011)19. Gonzlez-Camacho, J. M. et al. Genome-enabled prediction of genetic values using radial basis function. Theor. Appl. Genet. 125, 759771 (2012).
20. Heslot, N., Yang, H. P., Sorrells, M. E. & Jannink, J. L. Genomic selection in plant breeding: A comparison of models. Crop Sci. 52, 146160 (2012).
21. Prez-Rodrguez, P. et al. Comparison between linear and non-parametric models for genome-enabled prediction in wheat. G3-Genes Genom. Genet. 2, 15951605 (2012).
22. Lpez-Cruz, M. et al. Increased prediction accuracy in wheat breeding trials using a marker environment interaction genomic selection model. G3-Genes Genom. Genet. 5, 569582 (2015).
23. Braun, H. J., Pfeier, W. H. & Pollmer, W. G. Environments for selecting widely adapted spring wheat. Crop Sci. 32, 14201427 (1992).
24. Poland, J. et al. Genomic selection in wheat breeding using genotyping-by-sequencing. Plant Genome 5, 103113 (2012).25. Lobell, D. B. et al. Prioritizing climate change adaptation needs for food security in 2030. Science 319, 607610 (2008).26. Farooq, M., Bramley, H., Palta, J. A. & Siddique, K. H. M. Heat stress in wheat during reproductive and grain-lling phases. Crit. Rev. Plant Sci. 30, 117 (2011).
27. Crossa, J., Jarquin, D., Franco, J., Prez-Rodrguez, P., Burgueo, J., Saint-Pierre, C., Vikram, P., Sansaloni, C., Petroli, C., Akdemir, D., Sneller, C., Reynolds, M., Tattaris, M., Payne, T., Guzman, C., Pea, R. J., Wenzl, P. & Singh, S. 21016. Genomic prediction of gene bank wheat landraces. G3: Genes Genomes Genetics, doi: 10.1534/g3.116.029637 (2016).
28. Jarquin, D., Specht, J. & Lorenz, A. Prospects of genomic prediction in the USDA Soybean Germplasm Collection: Historical data creates robust models for enhancing selection of accessions. G3-Genes Genom. Genet, http://dx.doi.org/10.1101/055038 (2016).
29. ESRI, ArcGIS Desktop Help 10.2.2. URL: http://resources.arcgis.com/en/help/(2015). (Accessed: 20 April 2015).30. Pask, A. J. D., Pietragalla, J., Mullan, D. & Reynolds, M. P. Physiological breeding II: a field guide to wheat phenotyping. eds International Maize and Wheat Improvement Center (CIMMYT), Mexico DF URL: http://repository.cimmyt.org/xmlui/bitstream/ handle/10883/1288/96144.pdf (2012). (Accessed: 20 April 2015)
31. Hoisington, D., Khairallah, M. & Gonzalez-de-Leon, D. Laboratory protocols, CIMMYT Applied Molecular Genetics Laboratory, 2nd ed. CIMMYT, Mexico, DF URL: http://repository.cimmyt.org/xmlui/bitstream/handle/10883/1333/91195.pdf (1994). (Accessed: 20 April 2015).
32. Li, H. et al. A high density GBS map of bread wheat and its application for dissecting complex disease resistance traits. BMC Genomics 16, 216 (2015).
33. Sansaloni, C. et al. Diversity Arrays Technology (DArT) and next-generation sequencing combined: genome-wide, high throughput, highly informative genotyping for molecular breeding of Eucalyptus. BMC Proc. 5, P54 (2011).
34. Vikram, P. et al. Unlocking the genetic diversity of Creole wheats. Nat. Sci. Rep. 6, 23092 (2016).
SCIENTIFIC REPORTS
10
www.nature.com/scientificreports/
Acknowledgements
Authors acknowledge the nancial support received from the Mexican Secretariat of Agriculture, Livestock, Rural Development, Fisheries and Food (SAGARPA) through the project Seeds of Discovery-Sustainable Modernization of Traditional Agriculture project (MasAgro). Authors also acknowledge Diversity Array Technology (DArT), Canberra, Australia and CIMMYT scientists for their contributions.
Author Contributions
S.S. and J.C. conceptualized the experiment; C.S.P., G.F.D., P.F.L., E.S.M., J.I.M., V.M.H.M. and V.Z.V. performed phenotyping; C.S., P.V. and D.S. generated and processed G.B.S. data set; J.B., J.C., K.M., C.S.P., D.J. and P.V. analyzed the data; C.S.P. prepared the dra M.S. and all authors contributed to nalize; all authors have read and approved the M.S.
Additional Information
Supplementary information accompanies this paper at http://www.nature.com/srep
Competing nancial interests: The authors declare no competing nancial interests.
How to cite this article: Saint Pierre, C. et al. Genomic prediction models for grain yield of spring bread wheat in diverse agro-ecological zones. Sci. Rep. 6, 27312; doi: 10.1038/srep27312 (2016).
This work is licensed under a Creative Commons Attribution 4.0 International License. The images or other third party material in this article are included in the articles Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/
SCIENTIFIC REPORTS
11
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Copyright Nature Publishing Group Jun 2016
Abstract
Genomic and pedigree predictions for grain yield and agronomic traits were carried out using high density molecular data on a set of 803 spring wheat lines that were evaluated in 5 sites characterized by several environmental co-variables. Seven statistical models were tested using two random cross-validations schemes. Two other prediction problems were studied, namely predicting the lines' performance at one site with another (pairwise-site) and at untested sites (leave-one-site-out). Grain yield ranged from 3.7 to 9.0 t ha-1 across sites. The best predictability was observed when genotypic and pedigree data were included in the models and their interaction with sites and the environmental co-variables. The leave-one-site-out increased average prediction accuracy over pairwise-site for all the traits, specifically from 0.27 to 0.36 for grain yield. Days to anthesis, maturity, and plant height predictions had high heritability and gave the highest accuracy for prediction models. Genomic and pedigree models coupled with environmental co-variables gave high prediction accuracy due to high genetic correlation between sites. This study provides an example of model prediction considering climate data along-with genomic and pedigree information. Such comprehensive models can be used to achieve rapid enhancement of wheat yield enhancement in current and future climate change scenario.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer