Abbreviations
ANN: artificial neural network
BLUPs: Best Linear Unbiased Predictions
GBLUP: Genomic Best Linear Unbiased Prediction
GP: Genomic Prediction
MLP: Multilayer Perceptron
QTL: Quantitative Trait Loci
QTN: Quantitative Trait Nucleotide
RF: Random Forest
RKHS: Reproducing Kernel Hilbert Spacing
SNP: single nucleotide polymorphism
SVM: Support Vector Machine
SVR: Support Vector Regression
XGBoost: Extreme Gradient Boosting
Introduction
The phenotype of an individual is based on its genetic makeup, the environment and the interplay between them. In plant and animal breeding, the genomic prediction (GP) model, using a genome-wide set of markers, is an integral component of the genomic selection-based approach. 1 A GP model is constructed on a reference population for which both genotypes and corresponding phenotypes are known, mostly employing a cross-validation strategy, and applied to related populations with only genotypes known. The total genomic value, estimated from the GP model, is used as a pseudo-phenotype to select the best parents for the next generation(s). In general, phenotypes differ from each other in terms of their genetic complexity, ranging from simple/monogenic to complex/polygenic. These differences impact the potential performance of GP. Complex traits are predominantly governed by a combination of additive and non-additive (e.g. dominant/recessive, epistatic etc.) allele effects, which makes GP challenging for these traits. 2 The genetic architecture of complex traits is characterized by moderate to large numbers of Quantitative Trait Loci (QTLs) with small to medium effect sizes and no or few large effect QTLs. 3 Moreover, the ratio of additive to non-additive genetic variance may differ even for closely related traits. Besides the actual genetic variance level, its distribution over the genome is also a determinant of the trait architecture. 4 Next to genetic architecture, population structure plays a role as well ( Figure 1): prediction accuracies are influenced by inconsistent relatedness among samples due to ancestral allele frequency imbalance among sub-populations (population structure) or cryptic structures, e.g. familial relationships; linkage disequilibrium (LD) structure, due to inbreeding or selection pressure; varying relatedness between training and test populations, e.g. over the course of a breeding cycle; and sizes of reference and effective populations. 5
Figure 1.
Genomic prediction characteristics.
Factors affecting genomic prediction performance, often measured as correlation between true phenotype values and those predicted by a model.
Technological advances and statistical frameworks used bring new challenges ( Figure 1). Genotyping and/or phenotyping technologies can now generate millions of markers and thousands of phenotypic measurements, e.g. in time series, increasing the dimensionality of the prediction problem. For example, using a high-density SNP array (or imputing SNPs based on a low-density array) increases the likelihood of getting many markers in LD with the true QTL (high SNP-QTL LD). It can increase total explained variance, 6 but may induce multicollinearity among SNPs. Consequently, SNP selection prior to predictive modelling has been reported to provide superior performance compared to simply using a dense marker set. 7 In contrast, low-density genotyping can miss important SNPs in LD with, or weakly linked to, the QTLs, leading to inferior prediction performance. 8
Statistical genetics approaches have traditionally focused on formulating phenotype prediction as a parametric regression of one or more phenotypes on genomic markers, treating non-genetic effects as fixed or random in a linear equation. The resulting GP models are biologically interpretable but might yield poor performance for complex phenotypes, as linear regression fails to capture the more complex relations. 9 This approach also requires proper translation of prior knowledge on the genetics underlying phenotypes into parametric distributions. Although statistical distributions can help describe genetic architecture, devising a specific distribution for each phenotype is impractical. Therefore, many variations of linear regression were proposed by relaxing statistical assumptions; the main differences lie in their estimation framework and prior assumptions on the random effects (for an overview, see ‘Models’). Alternatively, machine learning (ML) offers a more general set of non-parametric methods that can model phenotypes as (non) linear combinations of genotypes. Moreover, these methods can jointly model the problem, e.g. strong learners can be stacked 10 or weak learners can be combined in an ensemble. Examples include Support Vector Machines (SVMs), (ensembles of) decision trees and artificial neural networks (ANNs). No statistical assumptions are required in advance; therefore, these methods should be able to pick up more complex genetic signals that are missed by linear models. The downside is the large amount of data required for learning these models from the data.
The performance of ML methods in GP problems has previously been compared using simulated and real phenotypes. Some were found to perform better under non-additive allelic activity
11
,
12
; however, a clear link between simulated and real phenotypes is often missing, or only a specific breeding population structure is considered. For example, Barbosa
In this paper, we investigate which GP characteristics (genetic architecture, population properties and genotype/phenotype measurement technology) a priori point to a better performance for either traditional statistical approaches or ML-based methods. We compare GP performance of two ensemble methods, Random Forests (RF) and Extreme Gradient Boosting (XGBoost), to that of linear mixed models, GBLUP, BayesA, BayesB and RKHS regression with averaged multi-Gaussian kernels. We focus on typical applications in plant breeding to explore various GP characteristics, including the ratio of the total number of markers to the number of samples (
Methods
Data
Simulations
In a first experiment, artificial genotypes were simulated, in combination with associated phenotype values. Genotype data was simulated for a diploid population with a minor allele frequency (MAF) of 0.4, using a binomial distribution, where each allele was the outcome of a binomial trial. The genotype dataset was coded as {0=AA, 1=Aa, 2=aa}. We fixed MAF for all SNPs, in order not to incorporate the impact of allele frequencies because MAF of a QTL can impact its heritability estimation and ultimately prediction accuracy of the GP model. Moreover, in this way we observed equal and reasonably statistical power for each SNP during allele effects estimation. To explore GP characteristics (
Figure 1), different levels of genetic complexity and dimensionality, defined as the ratio of total number of SNPs to the sample size (
Phenotype datasets were generated using the simplePHENOTYPES v1.3.0 (RRID:SCR_022523) R package. 21 Additive polygenic phenotypes were simulated using additive modes of allele effects, as follows:
(1)
Here,
For non-additive phenotypes, broad-sense heritability was set at most to 0.8, so the distribution of residuals is
(2)
The epistatic heritability (
Table 1.
Simulation scenarios permutations.
Additive phenotypes (
| Epistatic phenotypes (
| |||||
---|---|---|---|---|---|---|
#Markers (
| Ratio (
| #QTNs (
| #Markers (
| Ratio (
|
|
|
500/3k | 0.17 | 5, 50, 100, 250, 500 | 500/5, 500/50, 500/100, 500/250, 500/500 | 100, 10, 5, 2, 1 | 0.1, 0.4, 0.7 | 0.7+0.1, 0.4+0.4, 0.1+0.7 |
500/2k | 0.25 | 5, 50, 100, 250, 500 | 500/5, 500/50, 500/100, 500/250, 500/500 | 100, 10, 5, 2, 1 | 0.1, 0.4, 0.7 | 0.7+0.1, 0.4+0.4, 0.1+0.7 |
500/1k | 0.50 | 5, 50, 100, 250, 500 | 500/5, 500/50, 500/100, 500/250, 500/500 | 100, 10, 5, 2, 1 | 0.1, 0.4, 0.7 | 0.7+0.1, 0.4+0.4, 0.1+0.7 |
500/500 | 1 | 5, 50, 100, 250, 500 | 500/5, 500/50, 500/100, 500/250, 500/500 | 100, 10, 5, 2, 1 | 0.1, 0.4, 0.7 | 0.7+0.1, 0.4+0.4, 0.1+0.7 |
1k/500 | 2 | 5, 50, 100, 500, 1k | 1k/5, 1k/50, 1k/100, 1k/250, 1k/1k | 200, 20, 10, 2, 1 | 0.1, 0.4, 0.7 | 0.7+0.1, 0.4+0.4, 0.1+0.7 |
5k/500 | 10 | 5, 50, 100, 2.5k, 5k | 5k/5, 5k/50, 5k/100, 5k/250, 5k/5k | 1k, 100, 50, 2, 1 | 0.1, 0.4, 0.7 | 0.7+0.1, 0.4+0.4, 0.1+0.7 |
10k/500 | 20 | 5, 50, 100, 5k, 10k | 10k/5, 10k/50, 10k/100, 10k/250, 10k/10k | 2k, 200, 100, 2, 1 | 0.1, 0.4, 0.7 | 0.7+0.1, 0.4+0.4, 0.1+0.7 |
20k/500 | 40 | 5, 50, 100, 10k, 20k | 20k/5, 20k/50, 20k/100, 20k/250, 20k/20k | 4k, 400, 200, 2, 1 | 0.1, 0.4, 0.7 | 0.7+0.1, 0.4+0.4, 0.1+0.7 |
60k/500 | 120 | 5, 50, 100, 30k, 60k | 60k/5, 60k/50, 60k/100, 60k/250, 60k/60k | 12k, 1.2k, 600, 2, 1 | 0.1, 0.4, 0.7 | 0.7+0.1, 0.4+0.4, 0.1+0.7 |
*
Note: 1k=1000.
Real datasets
To compare trends observed in simulations with outcomes obtained with real traits, publicly available wheat genotype and phenotype data were taken from Norman, Taylor.
24
This includes 13 traits: biomass, glaucousness, grain protein, grain yield, greenness, growth habit, leaf loss, leaf width, Normalised Difference Vegetative Index (NDVI), physiological yellows, plant height, test weight (TW) and thousand kernel weight (TKW). This particular dataset was chosen as it contains a fairly large number of genotypes (
The data was generated from a small-plot field experiment for pre-screening of germplasm containing some genotypes that are sown in multiple plots, thus containing spatial heterogeneity with correlation between closely located plots and imbalance in the number of phenotypes per genotype. Soil elevation and salinity, spatial coordinates and virtual blocks (made available on request by the authors) were taken as covariates:
(3)
Here,
X is the
Population structure analysis
To analyse the influence of population structure on the performance of different GP methods, we used a population of the
The effect of population structure was also assessed on real data: a genotype dataset of 300 out of the 1,307 RegMap accessions, phenotyped for the sodium accumulation trait with a strongly associated gene.
29
This should resemble one of our simulation scenarios, i.e. high heritability (e.g.
To correct for population structure, we used principal components corresponding to the top ten highest eigenvalues as fixed effects in the models for GBLUP, RKHS regression, BayesA and BayesB. 30 Principal component analysis (PCA) was performed on the allele dosage matrix using the prcomp() method in R, with centering and scaling. For random forest and XGBoost, we used these top principal components as additional features in the models.
Analysis of SNP-QTN linkage disequilibrium (LD)
To explore the impact of varying LD between SNP markers and actual QTNs on the performance of GP methods, we used two other datasets: one with real genotypes and simulated phenotypes, the other with real genotypes and real traits.
For the first dataset, we selected a natural population with minimal structure, balanced LD, genotyped at roughly equal genomic spacing and mostly inbred lines: the 360 accessions in the core set of the
For the second dataset, we used three soybean traits (HT: height, YLD: yield and R8: time to R8 developmental stage) phenotyped for the SoyNam population. 32 This dataset contains recombinant inbred lines (RILs) derived from 40 biparental populations and the set of markers have been extensively selected for the above traits. Moreover, high dimensionality is not an issue as the dataset contains 5,014 samples and 4,235 SNPs. We refer to this dataset as ‘LD-soy’ in the text. A complete list of datasets used in this study has been provided in Table 2 and achieved into public repositories. 22 , 23
Table 2.
List of datasets. 23
ID | Description |
|
|
---|---|---|---|
simdata | Simulated dataset used to explore GP characteristics of trait genetic complexity, population properties and dimensionality. | See Methods section 2.1.1 for details. | |
Wheat | Real wheat dataset from Norman, Taylor
24
containing 13 traits of varying genetic complexity. These traits are referred to by abbreviations:
| 10,375 | 17,181 |
STRUCT-simdata | Real structured RegMap panel genotype data of
| 1,307 | 15,662 |
STRUCT-realdata | A subset of the real
| 300 | 169,881 |
LD-simdata | An unstructured set accessions from the core set of the
| 344 | 48,343 |
LD-soy | Real soybean dataset of with real phenotypes (R8, HT: height and YLD: yield) for studying the impact of low SNP-QTN LD 32 | 5,014 | 4,235 |
Models
A wide range of statistical models have been proposed for GP. Most widely applied are Linear Mixed Models (LMMs), which use whole-genome regression to tackle multicollinearity and high-dimensionality with shrinkage during parameter estimation, employing either a frequentist approach, e.g. restricted maximum likelihood (REML), or Bayesian theory. 33 Below, we briefly describe the GP methods used in our experiments. For (semi) parametric methods, we used BGLR v1.1.0 (RRID:SCR_022522) with default settings of hyperparameters 34 ; for Random Forests, the ranger R package v0.14.1 (RRID:SCR_022521) 35 ; and for XGBoost, h2o4gpu v0.3.3 (RRID:SCR_022520). 36
Parametric models
The genomic best linear unbiased prediction (GBLUP) method uses a Gaussian prior with equal variance for all markers and a covariance matrix between individuals, called the genomic relationship matrix (GRM), calculated using identity by state (IBS) distances between markers for each pair of samples. 37 SNP effects are modelled as random effects that follow a normal distribution with zero mean and common variance, and are estimated by solving the mixed model equation:
(4)
Here,
g is an
Several Bayesian methods with slight variations in their prior distributions have been proposed to model different genetic architectures
39
e.g. BayesA, using a scaled
(5)
Here,
μ is the intercept,
x
Semi-parametric models
Reproducing Kernel Hilbert Spaces (RKHS) regression is a general semiparametric method that models pairwise distances between samples by a Gaussian kernel and can therefore better capture nonlinear relationships than GBLUP. In fact, GBLUP is a special case of RKHS regression, with a linear kernel 42 , 43 https://paperpile.com/c/ZyQHHy/1oKK+NO1v. We used RKHS regression as a representative semi-parametric model, because it not only employs prior assumptions for random components in LMM equation (6), but also learns hyperparameters from the data itself:
(6)
In contrast to the GBLUP model (
4), the RKHS regression model has three random genetic components
, such that
g
(7)
The kernel
Ensemble machine learning models
The Random Forest (RF) regressor uses an ensemble of decision trees (DTs) that are each grown using bootstrapping (random sampling with replacement of samples), and a random subset of SNPs. The test sample prediction is made by averaging all unpruned DTs as;
(8)
Here
We used XGBoost, a specific implementation of the Gradient Boosting (GB) method. Similar to the Random Forest, Gradient Boosting is an ensemble method, using weak learners such as DTs. The main difference is that an RF aggregates independent DTs trained on random subsets of data (bagging), whereas GB grows iteratively (boosting) by selecting samples in the subsequent DTs based on sample weights obtained in previous DTs, related to how well samples are predicted already by these previous DTs.
Hyperparameters were tuned using a grid search through five-fold cross-validation on each training data fold. We searched over max_depth = {2, 3, 4, 50, 100, 500}, colsample_bytree = {0.1, 0.2, 0.3, 0.5, 0.7, 0.9} and subsample = {0.7, 0.8, 0.9}.
Performance evaluation
Model performance was evaluated based on prediction accuracy, which was measured as the Pearson correlation coefficient (
Assessment of trait non-additivity
To link GP performance in simulation scenarios with performance on real data, an assessment of the nature of real traits (i.e. additive or epistatic) was used. To obtain a proxy for additivity of the trait, we assumed that if a trait has a higher proportion of additive variance compared to other traits, estimated with the same model, it will be more additive. To verify this on our simulated dataset scenarios ( Table 1) for epistatic phenotypes, we used the linear mixed model:
(9)
Here
g
Results
ML outperforms traditional methods for GP
Previously, numerous GP methods were tested for different traits of varying genetic architectures using low or high density marker sets, but it is still unclear for which (class of) GP problems applying machine learning (ML) can be beneficial.
9
To investigate the role of underlying characteristics (
Figure 1), we generated an extensive set of simulated genotype-phenotype data (simdata: see Section ‘Simulations’). This data was analysed using the linear parametric methods GBLUP, BayesA and BayesB; the nonlinear semi-parametric regression method RKHS, using a Gaussian multi-kernel evaluated as average squared-Euclidean distance between genotypes
42
; and popular nonlinear ML methods, i.e. support vector regressor (SVR), random forest regressor (RF), extreme gradient boosting (XGBoost) regression trees and a fully-connected feed forward artificial neural network i.e. Multilayer Perceptron (MLP). The simulations covered a variety of trait scenarios (from simple to more complex), as shown in
Table 1. Simple oligogenic traits correspond to simulation scenarios with larger heritabilities, additive allele effects and small numbers of QTNs; complex traits can have both additive and non-additive allele effects (only epistatic here) with small heritabilities and large numbers of QTNs. For additive phenotypes, narrow-sense heritability was set equal to broad-sense heritability and for the epistatic phenotypes, the sum of narrow-sense and epistatic heritability was set equal to the broad-sense heritability. The extent of phenotypic additivity in both simulations and real datasets was calculated using the ratio of additive genetic variance to the epistatic genetic variance (
ML methods perform well for simple traits
Many non-mendelian plant traits are fairly simple, where only one or a few QTLs explain a large proportion of phenotypic variance, called oligogenic traits. If these QTLs are identified by the GP model, prediction performance can be pretty high. In our simulations (
Table 1), this scenario is investigated using additive phenotypes with narrow-sense heritability (
The results in Figure 2A, Figure 2B and Figure 2C illustrate that the performance of Bayesian methods and ML was significantly better (p value < 0.01; Extended Data, Table S1 48 ) than that of genomic relationship-based methods (GBLUP, RKHS). The performance of ML methods was slightly poorer than that of Bayesian methods when all QTNs effects were equal ( Figure 2A) or sampled from a Gaussian distribution ( Figure 2C) but comparable when one of them had a larger effect size ( Figure 2B). Therefore, although not outperforming the other methods, ensemble ML methods seem to be reasonable choices for simple traits.
Figure 2.
Comparison of prediction performances using simulated simple and complex phenotypes.
Performance of parametric (GBLUP), semi-parametric regression (RKHS), parametric Bayesian (BayesA, BayesB) and nonparametric ML (RF and XGBoost) methods as average accuracy over
ML methods outperform parametric methods for complex traits
Complex polygenic traits may contain a large effect QTL along with many small to medium effect QTLs. 49 Despite assuming perfect LD between SNPs and their corresponding QTLs, their detection remains challenging through conventional univariate regression models that are followed by strict multiple testing corrections. Moreover, shrinkage of random effects towards zero in multivariate regression models restricts them from growing too large. Thus, many true small effects may be ignored in the analysis. SNPs may also have non-additive effects, which could cause a large amount of variance to remain unexplained and narrow-sense heritabilities to be low, when modelled by their additive action only.
This genetic complexity was simulated by increasing the number of QTNs, decreasing the narrow-sense heritability and keeping overall effect sizes equal, thereby letting the effect sizes per QTN become proportionally smaller. The QTNs were randomly chosen from the simulated SNPs pool by setting
ML methods are generally suitable for epistatic phenotypes
For complex phenotypes, we observed that ML outperformed LMMs under highly polygenic phenotypes with epistatic effect and equivalent to Bayesian LMMs when at least one QTN had larger effect ( Figure 2D and E). To explore further, we investigated a range of additive and non-additive fractions of heritabilities, with or without a large effect QTN and from Gaussian distribution defined in our simulation scenarios ( Table 1).
For additive phenotypes with equal QTN effect sizes, performance of ML methods was poorer than that of Bayesian methods under all scenarios; with an increase in genetic complexity (lowering
For the phenotypes explained by a large effect QTN and many small effect QTNs (Extended Data: Figures S3A and S3B
27
), Bayesian methods perform comparable to ML methods for both additive and epistatic phenotypes under all simulation scenarios, although RF gave slightly better performance for epistatic phenotypes with large epistatic heritability (for
For both additive and epistatic phenotypes (Extended Data: Figures S4a, and S4b
27
), the ensemble ML methods were still superior over BLUPs and comparable to Bayes when effect sizes were sampled from a Gaussian distribution for a small number of QTNs (e.g.
In conclusion, our simulation results indicate that ML works well when a fair proportion of broad-sense heritability is contributed by allele interaction effects or a few large effect QTNs.
ML performance is robust to high-dimensional GP
Genomic prediction is usually employed on a genome-wide set of markers to yield total genomic value, but the training population size is limited, i.e. a high dimensional problem. This results into more statistical power to detect QTLs with many SNPs in LD but comes with obscured genetic variance when added together. Consequently, it leads to an overestimation of allelic variances or genomic relationships, overfitting on training samples and reduced performance on unseen data. To investigate the susceptibility of different GP methods for this issue, we analysed how prediction accuracy varied depending on the ratio of markers vs samples (c =
In general, the results with different simulation settings of ‘simdata’ for additive phenotypes show that performance is negatively related to an increase in dimensionality when main effects got smaller due to decreasing heritability or increasing total number of QTNs (Extended Data: Figure S2A, Figure S3A and Figure S4A 27 ). This implies that for simple traits having one or few large effect QTNs ( Figure 2A to C), performance degradation is not a severe issue for Bayesian and ML methods but it can still be a potential problem for genetic distance-based methods i.e. GBLUP and RKHS., presumably because of increased uninformative markers in calculating the genetic kinships. For the epistatic phenotypes, high dimensionality still doesn’t affect ML until we have sufficiently large main effects ( Figure 2A and 2B; Extended Data: Figure S2B, Figure S3B and Figure S4B 27 ). Here, for the case when main effects were sampled from a Gaussian distribution, increasing polygenicity is analogous to having many small main effects; so, despite having epistatic effects, performance goes down for all methods. In the nutshell, this shows that the conclusions drawn in Section ‘ML methods perform well for simple traits’ and Section ‘ML methods outperform parametric methods for complex traits’ holds under high-dimensionality.
Case study in wheat
To see whether our simulation results hold on real traits, we used a dataset of 13 wheat traits
24
for a fairly large number of samples (10,375 lines) and 17,181 markers (
Figure 3.
Prediction accuracies of wheat traits.
Top: prediction accuracies for GP models on wheat traits, reported as the mean Pearson correlation coefficient (
ML methods are sensitive to population structure
Population structure (PS) is a well-known confounding factor that results in decreased diversity in training populations
25
and unrealistic inflated parameter estimates, e.g. for (co) variances of random effects in LMMs
51
https://paperpile.com/c/ZyQHHy/ByqH. Parametric and nonparametric ML methods, based on their modelling assumptions and approaches, may be differently sensitive to PS. To assess the impact of population structure on ML methods, we used real genotype data with a known population structure and combined it with both simulated (STRUCT-simdata) and real phenotypes (STRUCT-realdata). Only additive phenotypes were simulated, with varying complexity and dimensionality scenarios, as described earlier in Section ‘Simulations’. The STRUCT-simdata contains all 1,307
Correction for PS was carried out by including the top ten principal components corresponding to the largest eigenvalues as fixed effects into the mixed model equations or as additional features for ML methods. For the simulated phenotypes (Extended Data, Figure S6
27
), average pairwise difference of test accuracies before and after correcting for PS was slightly higher for ML methods (RF: 0.03 and XGBoost: 0.04) than for LMMs (GBLUP: 0.01, RKHS: 0.01, BayesA: 0.01 and BayesB: 0.00). Moreover, the correction resulted into relatively elevated accuracies for the scenarios with larger number of QTNs or low heritabilities. This illustrates that with smaller #QTNs and larger heritabilities (
To further explore this behaviour, we used real phenotypes of the sodium accumulation trait in
Figure 4.
Effect of correction for population structure for the sodium accumulation trait in
Boxplots present Pearson correlation coefficients (
ML methods can tackle low SNP-QTN LD
The utility of GP in genomic selection is based on the assumption that there are ample markers within a densely genotyped set of markers which are in LD with the QTLs. 1 The actual QTNs are generally unknown, but SNPs in LD can be used to (partially) capture their effect, depending on the actual correlation and allele frequencies. Therefore, it is worthwhile to investigate the impact of SNP-QTN correlation levels on GP performance 52 https://paperpile.com/c/ZyQHHy/Q1iWE. We used two settings, one with real genotypes and simulated phenotypes (LD-simdata), a second with real genotypes and real traits (LD-soy).
In simulations, GP model performance is evaluated based on the difference in prediction accuracies between a model trained on the actual QTNs and a model trained on SNPs in LD (QTN-linked SNPs). Our results show that when SNPs are highly correlated to QTNs (which is likely the case for densely genotyped markers set and
Figure 5.
Effect of SNP-QTN LD on prediction accuracy.
Prediction accuracy of different GP methods on simulated (A) and real soybean (B) datasets for high and low LD between SNPs and actual QTNs. The difference in median accuracies between these scenarios is indicated as Δ
As a real genotype and phenotype dataset, we used three Soybean traits, i.e. height, time to R8 developmental stage and yield (LD-soy). The motivation was to choose a low-dimensional real dataset with highly correlated SNPs to understand the impact of SNP-QTL LD only. The complete set of markers (4,235 SNPs) had many correlated SNPs, such that only 261 were left with low LD (
In conclusion, GP methods that model SNP-QTN or SNP-SNP relation as a nonlinear function (RKHS, RF, XGBoost) were more stable under low SNP-QTN LD compared to other methods (GBLUP, BayesA, BayesB). Moreover, RF seems to couple good prediction performance with reliability under low SNP-QTN LD.
Discussion
There is room for ML in genomic prediction
Genomic prediction has long been the realm of parametric methods, but recently nonlinear supervised ML methods have become increasingly popular. Yet literature is unclear on the characteristics of GP problems that warrant application of ML methods. This study fills this gap and concludes that nonlinear tree-based ensemble ML methods, especially Random Forests, can be a safe choice along with traditional methods for simple as well as complex polygenic traits where epistatic allele interaction effects are present. We simulated different scenarios mimicking the reality at a broader level e.g. the case of simple oligogenic traits (
Figure 2, panel A, B & C), where they outperformed BLUPs but not Bayesian LMMs. A similar trend can be observed in real data of Sodium accumulation trait (
Figure 4), where we studied the impact of LD. On the other hand, for complex traits scenarios (
Figure 2, panel D, E, F). Random Forests either outperformed when again a large effect was present (panel D & E) or were inferior to other methods, when all effects followed the Gaussian distribution (panel F). The latter (panel F) is prevalently observed for many complex traits, where accuracies are roughly comparable for all methods. Moreover, ML methods are robust to high dimensionality, although further improvements, e.g. statistical or prior knowledge driven regularization, may improve performance. ML methods are particularly useful compared to the frequently used GBLUP and RKHS regression given their higher performance. While Bayesian methods often perform on par with ML models, this is mainly when there are large effect QTNs and/or additive phenotypes. Moreover, Bayesian methods are prone to overfitting in case of small sample sizes (
Tree-based ensemble ML methods are a reasonable choice for GP
A wide range of parametric, semi-parametric and nonparametric methods can be used for GP, but it is impractical to test all for a particular application. The choice for a suitable method strongly depends on the GP problem characteristics, described in
Figure 1. While GP methodology can be compared using various model evaluation metrics (BIC, AIC, log likelihoods), we focused on their utility from a breeder’s perspective, so we compared only their prediction accuracies. We found that GP methods based on modelling the distance between genotypes using covariance structure(s), inferred from genomic markers (GBLUP and RKHS), were generally inferior to Bayesian and ML methods and less robust to high-dimensional problems likely because all of the
The parametric LMM equations can be solved using a Bayesian framework. Bayesian methods define prior SNP effects distributions to model different genetic architectures. Instead of a single distribution for all marker effects (e.g. BRR), it could be defined for each individual marker (e.g. BayesA). Mixture distributions have also been proposed (e.g. BayesC, BayesB). From the Bayesian alphabet, we used BayesA and BayesB as representatives because the first scenario, i.e. a single distribution for all markers, has been covered by GBLUP. Our results illustrate that these methods outperform GBLUP and RKHS regression when large effect QTNs are present, for both additive and epistatic phenotypes. On the other hand, tree-based ensemble ML methods had either comparable performance to Bayesian methods (for simple traits) or superior performance (for complex traits). Capitalising on the results from Appendix-I (Extended data 27 ) that these ML methods had better performances than other ML methods (SVR and MLP), we can argue that these tree-based ML methods are a reasonable choice to conduct GP.
Population structure analysis
Population structure can affect GP performance. Our results show that without correcting for population structure, test accuracies were lower than after correction for all methods. However, ML seems to be slightly more sensitive because the average difference between each pairwise test data accuracies was higher than other methods in the simulated data.
Confounding due to population structure can also be due to the frequently employed random cross-validation strategy for predictive modelling. 25 In random cross-validation, the reference population is randomly divided into subsets, one of which is iteratively selected for testing while the remaining subsets are used to train the model. While samples are all part of a test set once, under population structure some subpopulations may be over or under-represented in the training set. As a result, the model may get overfitted. A solution could be to use stratified sampling instead. On the other hand, parameter estimation may get misguided by within subgroup allele frequency differences rather than the overall true phenotype associated variance.
The impact of population structure can be dealt with in many ways. Conventionally, principal components of the SNP dosages or genomic relationship matrix are introduced as fixed effects in the mixed model equations. 54 , 55 Alternatively, phenotypes and genotypes can be adjusted by the axis of variations before predictive modelling. 5 Nevertheless, some residual structure often remains in the datasets, so it is important to check sensitivity of GP models to this confounding factor. Since ML methods (RF and XGBoost) do not employ any statistical prior and learn the association patterns from the data itself, they may be more sensitive to structure, as we found in our simulation results. But this is not clearly evident from the real phenotypes, so we cannot generalize this conclusion from our simulations.
Effect of SNP-QTN linkage disequilibrium
Despite technological improvements, low density SNP panels are usually cost-effective for routine genomic selection. Increasing marker density does not necessarily increase prediction accuracy, since accuracy is not a linear function of SNP density only. 56 – 58 Instead, many GP problem characteristics ( Figure 1) jointly affect performance. However, using low density SNP panels can negatively affect prediction performance, since relevant SNPs in LD with the QTLs can either be completely missing or SNPs only in low LD may be present. As a result, allele frequencies between SNPs and QTNs can be quite different, resulting in incorrect predictions. 52 Despite this, low SNP density can still be sufficient for populations with larger LD blocks, e.g. F2 populations, where QTL detection power is highest and in this case, we shouldn’t expect much improvement by increasing marker density. But it becomes an important consideration when LD starts to decay and population relatedness decreases in the subsequent crosses of the breeding cycle. In this context, our study addresses the question of whether certain GP methods, especially ML, are more sensitive to low SNP-QTL LD. The results using both simulated and real traits indicate that SNP-QTL LD could also be an important determinant of suitable GP methodological choice and that ML is robust against low LD.
A weak SNP-QTL correlation implies that the SNP is a weak predictor of phenotype and there is an imperfect match between the genotypic distribution and the actual underlying genetic distribution of the phenotype. When using penalized regressions, this can result in different shrinkage for the SNP than that required by the actual QTN, thereby leading to a low genetic variance attribution to that SNP. Therefore, we may expect better prediction by nonparametric ML methods, as they may better learn weak genetic signals and are more robust to low SNP-QTL LD problems. On the other hand, the semiparametric RKHS regression method, which measures genetic similarity between individuals by a nonlinear Gaussian kernel of SNP markers, also performed better than GBLUP and Bayesian methods under low SNP-QTN LD. The reason could be that under low SNP-QTN LD, true pair-wise genetic covariance estimation would be less accurate due to losing many important markers and considering all of them equal contributors towards total genetic covariance. In case of RKHS regression, a Gaussian distribution defines a SNP’s probable contribution towards total genetic covariance, which becomes more realistic in this scenario because fewer important SNPs are left than in the high SNP-QTN LD case. The Bayesian methods (BayesA and BayesB) had the largest decrease in test performance under low SNP-QTN LD compared to high SNP-QTN LD. This could be due to the application of penalties on individual marker effects, which shrinks the weak SNP-QTN associations towards zero for each SNP.
ML outperformed parametric methods for predicting complex wheat traits
Bread wheat breeding has huge impact on worldwide food security and socio-economic development. 59 Therefore, minor improvements in GP methodology leading to overall genetic gain can have high impact. In this study, we used a large (10,375 lines) Australian germplasm panel, genotyped with a high quality custom Axiom TM Affymetrix SNP array and phenotyped for multiple traits with varying complexity levels. 24 The authors showed that genomic selection was superior to marker-assisted selection (MAS) by employing GBLUP with two random genetic components (referred to as full-model in their text). Our results clearly indicate that ML can perform well for complex bread wheat traits, e.g. grain yield, yellows, greenness, biomass and NDVI. However, for NDVI, the larger difference between LMMs and ML could be due to low phenotypic variance and heritability for this trait in this dataset. All of these traits except grain yield can be measured using high-throughput automated phenotyping. 60 This is an interesting finding since, with the rapid advances in low cost high-throughput phenotyping systems, attention is shifting towards measuring component traits, e.g. vegetative indices, rather than final yields. ML methods can predict these traits more accurately, as evident from our analysis.
Conclusions and outlook
Based on simulated and real data, we conclude that tree-based ensemble ML methods can be useful for GP for both simple and complex traits. Moreover, these methods can work for both low- and high-density genotyped populations and can be a competitive choice for practical plant breeding. In practice, which method works best depends on the particular problem, i.e. genetic architecture of the trait, population size and structure and data dimensionality. Between bagged (Random Forests) or boosted (XGBoost) decision tree ensemble methods, random forest seems to be a good first choice for GP given their generalization performance. Furthermore, population structure should properly corrected for to obtain stable performance. It would be interesting to investigate to what extent these ML methods can benefit from statistical or prior knowledge-based regularization techniques.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Copyright: © 2023 Farooq M et al. This work is published under https://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
Background: Many studies have demonstrated the utility of machine learning (ML) methods for genomic prediction (GP) of various plant traits, but a clear rationale for choosing ML over conventionally used, often simpler parametric methods, is still lacking. Predictive performance of GP models might depend on a plethora of factors including sample size, number of markers, population structure and genetic architecture.
Methods: Here, we investigate which problem and dataset characteristics are related to good performance of ML methods for genomic prediction. We compare the predictive performance of two frequently used ensemble ML methods (Random Forest and Extreme Gradient Boosting) with parametric methods including genomic best linear unbiased prediction (GBLUP), reproducing kernel Hilbert space regression (RKHS), BayesA and BayesB. To explore problem characteristics, we use simulated and real plant traits under different genetic complexity levels determined by the number of Quantitative Trait Loci (QTLs), heritability (
Results: Decision tree based ensemble ML methods are a better choice for nonlinear phenotypes and are comparable to Bayesian methods for linear phenotypes in the case of large effect Quantitative Trait Nucleotides (QTNs). Furthermore, we find that ML methods are susceptible to confounding due to population structure but less sensitive to low linkage disequilibrium than linear parametric methods.
Conclusions: Overall, this provides insights into the role of ML in GP as well as guidelines for practitioners.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer