- BLUE
- best linear unbiased estimate
- BLUP
- best linear unbiased prediction
- GS
- genomic selection
- NIRS
- near-infrared spectroscopy
- PCA
- principal component analysis
- PLSR
- partial least squares regression
- PS
- phenomic selection
- RF
- random forest
- SNP
- single nucleotide polymorphism
- SVMR
- support vector machine radial
- TTA
- total titratable acidity
- UFBBP
- University of Florida Blueberry Breeding Program
Abbreviations
INTRODUCTION
Blueberry fruits (Vaccinium corymbosum) have garnered recognition as a superfood due to their high contents of antioxidants and other health-promoting compounds (Dias et al., 2023; Kalt et al., 2020). This status has increased the demand for both fresh and frozen fruits within the global market over the past decade (Protzman, 2021). Because of this, blueberry growers and marketers require new cultivars that meet the current market standards for quality, yield, and shelf life. In this context, the use of new tools that can maximize genetic gains by increasing selection accuracy and/or reducing the length of breeding cycles is appreciated to leverage breeding decisions.
Conventional blueberry breeding has relied on phenotypic recurrent selection methods (Lyrene, 2005), which results in gradual genetic gains over many generations. With the recent advancements in molecular breeding tools, such as marker-assisted and genomic selection (GS), genetic improvement can be accelerated. Recent studies have suggested that these tools can reduce the time needed to release a new blueberry cultivar by more than 30% when compared to traditional methods (de Bem Oliveira et al., 2020; Ferrão et al., 2021). Despite these benefits, applying GS is not straightforward. The costs associated with genotyping a large number of seedlings are still high for routine implementation (Adunola, Ferrao, et al., 2024; de Bem Oliveira et al., 2020). Additionally, the time required to obtain genotypic data can be disadvantageous, particularly when time-sensitive breeding decisions must be made. Phenomic-assisted selection emerged as a viable alternative to circumvent such challenges. To date, there have been several successful examples of phenomic-assisted breeding across multiple crop species, achieving both cost-effective and high-throughput phenotyping (Adunola, Tavarez, et al., 2024; Biswas et al., 2021; Parmley et al., 2019; Rincent et al., 2018). These advancements are beneficial for phenomics to effectively complement the use of molecular markers in applied breeding populations.
Phenomic selection (PS) can be conceptualized in different forms using various technologies like near-infrared spectroscopy (NIRS), remote sensing, RGB, multi-spectral and hyper-spectral imaging. Similar to the aforementioned technologies, the use of NIRS to measure the relationship degree between individuals in a population has gained prominence for prediction in the plant literature due to the non-destructive nature of data collection (Costa et al., 2019; Tsuchikawa & Kobori, 2015). NIRS leverages the infrared region of the electromagnetic spectrum to detect the physical and chemical properties of samples (Cozzolino, 2021; Khuwijitjaru, 2018). Among the key applications of NIRS in agriculture, we can highlight its implementation in the quantification of nutrient composition, quality certification, species fingerprinting, and the characterization of chemical bonds (Begley et al., 1984; Lang et al., 2017; Nicolaï et al., 2007).
In the context of PS, NIRS can be used to measure endophenotypic variation among samples (Brault et al., 2022; Robert et al., 2022). Endophenotypes are defined as the different molecular layers between the genotype and phenotype, and they can be used analogously to genomic information to infer the relationship between individuals (Rincent et al., 2018). Regarding the use of NIRS, the spectral profile of an individual can be considered the equivalent of a molecular marker. Thus, the motivation is to use NIRS information as predictors of target phenotypes, and then use such information to access the phenotypic merit of non-phenotyped individuals. Several empirical studies have shown the potential of PS to predict complex traits, with predictive abilities comparable to GS methods (Adunola, Tavarez, et al., 2024; Robert et al., 2022; Winn et al., 2023). For instance, Rincent et al. (2018) found that NIRS-based prediction was equal or superior to GS for wheat and poplar traits. In soybean, similar results were reported by Zhu et al. (2022). In perennial species, a study with grapevine showed comparable predictive results when genomic and PS were compared across several traits and years (Brault et al., 2022). Despite these promising results, the implementation of NIRS for PS in fruit crops is lacking. Furthermore, several important questions remain unanswered, including which plant tissues are best to collect NIRS from to predict fruit traits, how the predictive abilities obtained by PS compare to genomic-based methods, and the subsequent impact of these selection methods on the rate of genetic gain.
In blueberry, we hypothesize that PS could offer a cost-effective and high-throughput alternative to GS methods. However, evaluating the efficacy of PS in mature plants is an essential initial step toward understanding its potential for seedling selection. If successful, this approach could be particularly beneficial for early selection in the breeding program. Using the University of Florida Blueberry Breeding Program (UFBBP) as an example, approximately 20,000 seedlings are moved from the greenhouse to a high-density nursery every year (Ferrão et al., 2021). Reduction in the number of seedlings for field evaluations, followed by an increment in selection intensity, are both important elements to enhance genetic gains across generations. To this end, the use of NIRS and genomic information in the prediction of fruit quality traits was compared to determine the utility of PS in blueberries. Our objectives are fourfold: (i) conduct a comparative assessment of phenomic, pedigree, and genomic predictive abilities for blueberry fruit quality traits using different statistical models; (ii) compare the impact of using different biological tissues (fruits, leaves, and juice) for phenomic prediction; (iii) optimize PS by evaluating different training set sizes; and (iv) test the predictive performance over different environmental conditions. Altogether, attention was drawn to how genomic and phenomic information could be used within an integrated predictive breeding framework to guide breeding decisions in a perennial and polyploid species.
Core Ideas
- Near-infrared spectroscopy for phenomics showed good predictive performance for fruit quality in blueberry.
- Performance of phenomic selection depends on the tissue material and statistical model.
- Phenomic and genomic selection can be integrated in the same framework to assist blueberry breeders.
MATERIALS AND METHODS
Plant materials
In this study, we used two breeding cycles (hereafter referred to as “breeding populations”) from the UFBBP, named Waldo-2016 (WA-2016) and Waldo-2019 (WA-2019). They consisted of 149 and 223 genotypes, that originated from 101 and 153 full-sib families, respectively. The data collection took place in the UFBBP varietal trial field located in Waldo, FL (29.78° N, 82.16°W), between April and May during the 2022 and 2023 seasons. As part of our experimental checks, nine elite genotypes were evaluated in both years. Each population comprised mature plants, with WA-2016 and WA-2019 evaluated at 6 and 4 years of age, respectively.
In this study, the term “phenotypic” refers to the target trait, while we use the term “phenomics” to refer to the endophenotypes collected via NIRS. For the phenotypic data collection, ripe fruits were collected into two 4.4 oz clamshells. Each clamshell accommodated 25 fruit samples. All samples were stored in a cooling room at 1°C for 1 day before data collection. Phenotypic data, including waxy bloom, Brix, firmness, size, and total titratable acidity (TTA), were collected from all genotypes following the methods outlined in Ferrão et al. (2018). Bloom was evaluated using a visual score approach as follows: 1 for 0%–25% bloom, 2 for 25%–50%, 3 for 50%–75%, and 4 for 75%–100% bloom coverage (Casorzo et al., 2024). For the determination of total soluble solids (brix) and TTA, we extracted juice from the fruit and used a digital refractometer (Atago U.S.A., Inc.) and an automatic titrator (Mettler-Toledo, Inc.), respectively. Additionally, we calculated the Brix:TTA ratio for each genotype. Fruit size (mm) and firmness (g/mm) were determined using a FruitFirm 1000 machine from BioWorks Inc. after the samples had reached room temperature. The firmness measurement employed a force threshold ranging from 50 to 350 g mm−1. Firmness was calculated as the mean force to deflect the surface of the fruit by 1 mm using a compression probe.
The experiment was designed as an augmented block design with common checks connecting different years. Therefore blueberry phenotypic data were pre-corrected using a linear model: , where is the phenotype of response variable i in year j; is the overall mean; is the fixed-year effect; and are the genetic effect i of regular individuals and experimental checks at year j, respectively; and is the random residual effect. The best linear unbiased estimates (BLUEs) of each genotype were used as the response variable in subsequent analyses. Univariate models were fitted using the ASReml-R package (Butler et al., 2009).
NIRS data
NIRS data were collected from three different blueberry tissues (fruit, leaves, and juice). For the WA-2016 population, spectral data were collected from all tissues after one day of storage at 1°C, while for the WA-2019 population, spectra were collected only from fruits and leaves under the same conditions. For fruit samples, six random fruits were collected per genotype. After collecting the NIRS from the berries, we blended the fruits and extracted their juice, from which new NIR spectra data were collected. For leaf samples, we selected four leaves per genotype. The leaf samples were kept at 1°C and spectra data were collected within 24 h. For all samples, three technical replicates of absorbance data were collected.
All spectra data were measured using the VIAVI MicroNIR OnSite-W device along with its corresponding software. The device was calibrated by the manufacturer to provide wavelengths within the range of 906–1676 nm with a 6 nm step when used to collect spectra information. In total, 123 wavelength data points were obtained on per a scan basis during spectra data collection. To ensure data quality, we conducted checks for outliers and applied pre-treatment techniques. The spectra data were normalized (centered and scaled), and their second derivative was computed using a Savitzky–Golay filter with a window size of 15 data points, implemented with the pretreat_spectra() function in the waves R package (Hershberger et al., 2021). From the transformed NIR spectra data, we estimated variance components for genotypes, environments, and residuals at each wavelength. Briefly, we used a repeatability model since technical NIR spectra data replications were collected from the same individual. The following linear mixed model was used: , where is the vector of NIRS data for each wavelength; is the vector of random effect of the environment defined as , where is the identity matrix, and is the location variance; is a vector of random genetic effects defined as , where is the pedigree relationship matrix and is the genetic variance; and are incidence matrices relating the random effects; and e is the vector of residual effect following a normal distribution with zero mean and (co)variance matrix σ.
Genotyping
Young leaf samples were collected from each genotype. DNA extraction and genotyping were performed by LGC/RAPiD Genomics (Gainesville). The methodology is based on the CaptureSeq technology, in which 10,000 probes were designed for blueberries and sequenced using the Illumina HiSeq platform, with 150-cycle paired-end runs (Benevenuto et al., 2019; Ferrão et al., 2018). For the single nucleotide polymorphism (SNP) calling, sequencing reads were aligned to the Vaccinium corymbosum cv. ‘Draper’ genome (Colle et al., 2019), and SNPs were called using FreeBayes v.1.0.1 (Garrison & Marth, 2012). Because blueberry is an autotetraploid species, allele dosages were carried out using the empirical Bayes method implemented in the updog R package (Gerard et al., 2018). To this end, we first extracted the read counts from the vcf file and used the default “norm” prior density described in the updog R package, as discussed by Ferrão et al. (2020). A quality control assessment was implemented to exclude SNPs with a high proportion of missing data (>50%) and low variability (minor allele frequency of <1%), considering call rate below 50% and minor allele frequencies <1%. Subsequently, 38,909 markers passed these quality filters and were used for subsequent analyses.
Prediction models
For prediction, we tested four different statistical approaches. To this end, we considered NIRS and SNP information as either predictors or as kernel matrices to predicting the genetic merit. Here we present the full linear model used for prediction and describe how the terms were modified to accommodate (i) different assumptions about the predictor effects and (ii) information from multiple sources, including phenomics and genomics:
Regarding the NIRS and SNP information in each regression model, different priors were assumed for the vectors and with the distribution of the regression parameter modeled as follows:
Bayesian ridge regression: follows a normal distribution, described as , where is a common variance associated with each marker effect (or wavelength effect).
BayesA: In this approach, also follows a normal distribution, described as , where represents the variance associated with each marker effect (or wave effect).
BayesB: was modeled as a mixture distribution, , where represents the proportion of non-null effects and follows a beta distribution, where DG(0) is a degenerate distribution centered at zero.
Bayesian LASSO (BL): In this approach, follows a double-exponential distribution, which can be expressed as , where follows a gamma distribution.
The variance terms and follow a scaled-inverse chi-square density.
In a mixed model context, phenomic and genomic approaches were formulated by incorporating different relationship matrices. The assumptions and details for each approach are outlined in Table 1. All models were implemented within a Bayesian framework using the BLGR R package (Pérez & de Los Campos, 2014). A total of 120,000 iterations, including a burn-in period of 20,000 iterations and thinning every five iterations was used. The Markov Chain Monte Carlo iterations were used to estimate the posterior mean of the breeding values.
TABLE 1 Phenomic-genomic prediction approaches and their assumptions for the parameters of each model (1).
Approaches | Assumption |
NIRS-BLUP model | , and |
Genomic model | , and |
NIRS-Bayesian model | , and |
Genomic + NIRS model | and |
Pedigree + NIRS model | and |
In addition to the Bayesian models, we also explored alternative approaches, including both traditional statistical methods and machine learning techniques. These approaches included partial least squares regression (PLSR), which assumes that all effects ( within the model are fixed effects and employs latent variables, and the ordinary least squares method for estimating these effects. We also explored machine learning models such as random forest (RF), a method based on tree algorithms, and support vector machine radial (SVMR), a supervised learning algorithm that aims to find a regression hyperplane to minimize the distance from the sample point farthest from the hyperplane using a radial basis function as the kernel function. Specifically, RF was implemented using the randomForest R package, PLSR with the pls R package (Mevik & Wehrens, 2007), and SVMR using the e1071 R package.
Cross-validation scheme
The predictive performance of each model was assessed using a 10-fold cross-validation scheme, repeated five times. Further cross-validation involved training the model using one population to predict the other population. We also tested the effect of training population size on predictive ability, by training each model with a random subset of 50%, 60%, 70%, 80%, and 90% of samples from the calibration population, while the remaining individuals were used for testing. Predictive abilities were computed as the Pearson's correlation between observed and predicted values.
Estimate of genetic gain
To further compare genomic and PS, we computed the genetic gains using the Breeder's equation, defined as where is the selection intensity, rxg is the predictive ability computed using cross-validation and is the narrow-sense heritability estimated using the historical database from the UFBBP, which contains ∼6000 data points collected over seven seasons in five locations across Florida (de Bem Oliveira et al., 2020; Ferrão et al., 2021); is the phenotypic standard deviation, and is the cycle length. was assumed constant for both genomic and PS. Additionally, for PS was defined based on feasibility or time taken to collect spectra data from fruit and leaf tissues. All source codes and phenotypic and genomic data sets are available as Supporting Information Data S1.
RESULTS
NIRS characterization
For the NIRS data collection, three blueberry tissues, including leaves, fruits, and juice, were sampled for comparison. Throughout this study, the discussion will focus only on the results obtained via leaves and fruits. Overall, NIRS collected in juice samples resulted in poor predictive performance (Table S1). In addition, preparing the blueberry juice also required extra laboratory work and a lengthy centrifugation step to separate the liquid from the solid phase in the samples, which ultimately made the process far less efficient for practical implementation using large number of samples.
For fruits and leaves, the wavelength values were first corrected for the season and location effects using the check information across the years. Using these pre-corrected values, the NIRS profile of both tissues exhibited similar patterns, but at different levels of absorbance (Figure 1a). Larger absorbance values were observed with NIRS spectra collected in the fruits. Also, within the range of 1300 and 1676 (nm), greater variability in fruits, when compared to leaves, was observed. The resulting Pearson's correlations among wavelengths were also different across tissues (Figure 1b). For fruits, small blocks of large correlation values across the entire spectrum, while in the leaves this pattern was sparser. These differences ultimately resulted in different variance projections (Figure 1c) and principal component analysis (PCA) results (Figure 1d). It is important to acknowledge that previous studies have reported variability in reflectance and absorbance values from different genotypes and tissues, particularly within the wavelength range of 1700–2500 nm, which were not accounted for in this study (Brault et al., 2022; Rincent et al., 2018; Zhu et al., 2022).
[IMAGE OMITTED. SEE PDF]
Regarding the analysis of variance projection, we found that the environment contributed a large proportion of variance observed for fruit tissues, while for leaves, much of the variance was attributed to the residual. Notably, in both cases, we could see some peaks with high genetic variance (Figure 1c). For the fruit tissue, this peak was found around 1500 nm, while for leaves we noticed a peak around at the lower end of the spectrum, around 900 nm. The results from PCA for both tissues did not show any evidence of population stratification. This is expected since strong population structure has not been reported at the UFBBP populations using genomic and pedigree information. Importantly, for leaves, the first two principal components explained more than 85% of the variance (Figure 1d).
Phenomic prediction: Model comparison and predictive ability across biological tissues
For phenomic prediction, a total of 123 wavelengths, from two different biological tissues (fruits and leaves) were used to predict fruit quality traits. Before carrying out prediction across locations, we tested different statistical methods in a cross-validation scheme, in which we combined information from both populations after correcting for environmental effects and predicted the empirical BLUEs.
Overall, varying levels of predictive ability were observed across models and tissues. Greater predictive abilities were reported obtained by NIR spectra collected from fruit tissues (Figure 2a). By combining all traits and models, average predictive ability values of 0.382 and 0.117 were observed for fruits and leaves, respectively. For fruits, Brix showed the highest predictive value, with an impressive result of 0.70 using the RF method. For leaves, the highest value was reported for the Brix:TTA ratio, with a predictive value of 0.19 reported by the BayesB approach. When comparing models, BayesB, NIRS-best linear unbiased prediction (BLUP), and RF demonstrated the best predictive performance across traits and tissues.
[IMAGE OMITTED. SEE PDF]
Next, BayesB method was employed for predicting fruit traits within each population prediction using both tissues (Figure 2b). In general, similar predictive abilities were observed in both populations (WA-2016 and WA-2019). As presented previously, predicting Brix via NIRS, and using fruit tissue showed impressive results, while the use of leaf materials resulted in lower predictive performances. To further investigate the role of each wavelength on the phenotypic variation of fruit quality trait, we computed the Pearson's correlation across the entire NIRS spectrum (Figure 2c). In general, slightly different profiles were observed across both tissues. For Brix, there is a high correlation close to 1100 nm, indicating that a single NIRS band is positive and highly correlated (∼0.3) with soluble solids in blueberry. Large correlation values (negative and positive) were also identified within the same range of wavelengths for firmness.
Multi-omic prediction: Phenomic versus genomic selection
A central goal of this study was to compare phenomic and genomic predictions. To achieve this, NIRS and genomic information were combined into the BayesB model for predictions by tissue type in both populations (Figure 3a). This was complemented by prediction across the populations using only the fruit tissue (Figure 3b). Using a cross-validation scheme, we compared the predictive ability of four methods: a phenomic model (NIRS), genomic model (G), incorporating pedigree and PS (A + NIRS), and including genomic and phenomic information (G+NIRS).
[IMAGE OMITTED. SEE PDF]
In fruits, the use of A+NIRS model outperformed G model for all the traits (Figure 3a). When G+NIRS were incorporated into the same model, it yielded the best results across all traits. For leaves, the results for NIRS only was low across traits and when compared to G model, the use of A+NIRS model was comparable only for Brix:TTA and TTA. The incorporation of G+NIRS model into the same framework did not leverage the prediction ability, a fact that highlights the importance of GS compared to phenomic prediction when evaluations are performed at the leaf level (Figure 3a).
The performance of across-population prediction has important implications for breeding decisions, as it enables the identification of potential biases toward environmental conditions associated with phenomic predictions. To evaluate the potential of PS in a blueberry breeding program, prediction models were developed using data from one population to predict traits in another. Given the unstable results reported for NIRS prediction using leaf tissues, we reported across-population predictions only using the fruit tissue (Figure 3b). Thus, models trained with the WA-2019 population were used to predict traits in the WA-2016 population, and vice versa. The findings showed that phenomic prediction outperformed genomic prediction for traits such as wax bloom, Brix, and the Brix:TTA ratio. Moreover, by incorporating pedigree information we enhanced the predictive ability for Brix:TTA and TTA, surpassing the results of genomic prediction alone. When models were trained on the WA-2016 population to predict the WA-2019 population, phenomic predictions were on par with or better than genomic predictions for Brix, Brix:TTA, and size. Similar results were obtained for Brix and Brix:TTA when WA-2019 population was used predict WA-2016 population, with the exception of size. Additionally, the combined use of G+NIRS model showed increased predictive abilities for firmness and TTA.
Optimizing phenomic selection
The availability of resources significantly influences the population size used for calibrating predictive models, which, in turn, affects both the cost of breeding and the accuracy of predictions. Considering this, we evaluated the impact of the size of our training set for PS. To this end, we focused our comparison on the two biological tissues and their prediction ability measured via BayesB (Figure 4). A proportion of individuals ranging from 50% to 90% of total samples were randomly sampled to constitute the training population. We repeated this process 100 times and computed the predictive abilities across each fruit quality trait. As expected, we found predictive ability continually improved as the size of the training set increased for all traits and statistical methods. Nonetheless, the increase in predictive ability began to plateau at a 70% training set size in most scenarios.
[IMAGE OMITTED. SEE PDF]
Genetic gains per cycle
In Table 2, we present the narrow-sense heritability and predictive accuracies using GS and PS models per trait. The narrow-sense heritability (h2) was estimated using the historical database from the UFBBP made up of ∼6000 data points collected over seven seasons in five locations across Florida. Overall, all traits showed low to moderate heritability values (0.27–0.48). As previously mentioned, predictive abilities varied by plant tissues (leaf or fruit) and across selection methods (PS or GS).
TABLE 2 Genetic gains computed for firmness, size, Brix, and total titratable acidity (TTA) using the Breeder's equation.
Predictive accuracya | Genetic gains | |||||||
Trait | h2 | PS_leaf | PS_fruit | GS | Leaf (%) (1 year) | Fruit (%) (2 years)a | Fruit (%) (3 years)a | Genomics (%) (1 year) |
Firmness | 0.40 | 0.21 | 0.33 | 0.30 | 6.3 | 4.9 | 3.3 | 9 |
Size | 0.27 | 0.21 | 0.44 | 0.24 | 4.3 | 4.5 | 3 | 4.9 |
Brix | 0.33 | 0.20 | 0.66 | 0.28 | 4.4 | 7.4 | 4.9 | 6.2 |
TTA | 0.48 | 0.36 | 0.42 | 0.37 | 30.2 | 17.6 | 11.7 | 31 |
The expected genetic gains estimated for each trait and model combination are also shown in Table 2. To make it simpler, we assumed constant selection intensity and presented the genetic gains in percentage compared to the population average per trait. For the length of the breeding cycle, we considered the number of years required to make selection. Thus, in the genomic and phenomic prediction models fit using leaf tissue, we assumed that selections could be made during the seedling stage within 1 year after germination (L = 1). For fruits, we estimated that seedlings take 2 (L = 2) to 3 (L = 3) years after germination to produce the necessary amount of mature fruits to apply PS. Using PS, the largest gains were observed for TTA, with data collected at the leaf tissues. The largest difference between tissues was observed for Brix, with a larger gain observed for phenomic models trained using the fruits. On average, GS resulted in larger genetic gains, because it combines high predictive performance and short time to advance generations. We did not include the G+NIR comparison, because the inclusion of the NIR kernel into the GBLUP model did not improve significantly the predictive abilities (Figure 3a).
DISCUSSION
GS is a popular tool to predict the genetic merit of non-phenotyped individuals and has gained wide relevance in modern breeding programs. Both real and simulation studies have demonstrated dramatic improvements in selection efficiency through increasing selection accuracy and reducing the length of the breeding cycle, which are both important parameters to maximize genetic gains (Adunola et al., 2024; Atanda et al., 2021; Gaynor et al., 2017; Lenz et al., 2020). Despite this, the implementation of GS requires extensive resource allocation and logistical effort. For example, the cost of genotyping remains high, particularly in large breeding populations that rely on many seedlings in the early stages. The development of an integrative breeding plan that balances cost and accuracy remains a barrier to implementation, driving the need for alternative solutions.
In this study, we investigated the use of NIRS for predicting the genetic merit of complex fruit quality traits. In multiple crops, phenomic and GS have shown comparable results (Adunola, Tavarez, et al., 2024; Jackson et al., 2023; Rincent et al., 2018; Zhu et al., 2022). Here, the ability of PS using NIRS to predict six fruit quality traits in blueberry was determined, presenting opportunities for the adoption of NIRS in other fruit breeding programs.
Diverse statistical methods are consistent for phenomic prediction
To test the relevance of PS, this study employed NIRS covering the infrared wavelength range from 906 to 1676 nm, to predict the phenotypic merit of individuals for six fruit quality traits in blueberry. Typically, spectral data can range from a few hundred to thousands of data points, depending on the equipment used. In this study, the device utilized measured 123 wavelengths, considerably fewer than the 465 to 1165 wavelengths used by Rincent et al. (2018), Zhu et al. (2021, 2022), and Brault et al. (2022) – after applying quality control procedures. This difference can impact the predictive ability, although Zhu et al. (2021) have reported that retaining a smaller number of wavelengths with progressively lower correlations to the trait resulted in only a modest reduction in phenomic predictive ability in soybean. The fundamental discussion is the trade-off between including more wavelengths (and parameters) and its impact on labor and predictive accuracy. Here, we purposefully chose portable devices to measure NIR, at the cost to explore a reduced wavelength spectrum. However, it increased our ability to evaluate multiple plants, in a short period of time. While future studies testing a larger number of wavelengths to predict fruit quality traits are perfectly justifiable, we argue on the importance to consider the labor required for practical implementation.
A key focus of this research was to explore the application of statistical and machine learning algorithms capable of handling the high dimensionality inherent to spectral data while simultaneously capturing the genetic variability associated with each trait. This approach aims to optimize predictive accuracy and minimize bias. PS follows the same principles as GS, in which fitting models with different statistical assumptions can lead to better predictive performance, depending on the trait of study. After testing eight statistical models, most approaches displayed comparable predictive performance across the traits. Overall, results obtained via BayesB, RF, and mixed models (NIRS-BLUP) showed the most stable results across the different traits and tissues. Similar results have also been reported in other crops, suggesting that these methods could be used for practical implementation (Brault et al., 2022; Gonçalves et al., 2021; Zhu et al., 2022). Therefore, the BayesB model was selected for subsequent analysis.
Biological tissue impacted predictive ability and genetic gains
Our next objective was to determine the ideal blueberry tissue type for collecting NIRS spectra. We considered three biological tissues: fruits, leaves, and juice. From a practical standpoint, leaves are the easiest way to collect biological samples. They can be sampled multiple times throughout the year, assuming that NIRS spectra remain consistent across different plant phenological stages. Another advantage is that we do not need to wait for the plants to produce fruits and selection can be made at the seedling stages. Fruits, in contrast, are more directly linked to the final phenotype of interest, but can only be collected during the fruiting season of mature plants. Juice, on the other hand, is a byproduct of standard quality assessments for traits such as brix and acidity. After checking the predictive performances using these different tissues, juice samples were excluded from this paper. Two main reasons guided our final decision. First, collecting NIRS from the juice resulted in low and inconsistent predictive abilities for all fruit quality traits. Second, even as a byproduct of laboratory analyses, it requires extra work to prepare the samples, such as the need to be centrifuged for the separation of the liquid and solid phases for analysis. Simply stated, the use of juice samples for NIRS collection is not feasible for many samples.
The analysis of the NIRS wavelength spectrum for leaves and fruits revealed that certain wavelengths captured moderate to high genetic variability, which is consistent with previous research that sampled various tissues for such predictions (Brault et al., 2022; Rincent et al., 2018). Notably, the use of fruits resulted in higher predictive abilities than leaf tissues. This result is likely due to fruit tissue being closer to the metabolic state of the target phenotype.
It is noteworthy that predictions performed in fruits and leaves require data collection from two different biological materials, which are sampled at different points throughout the season. For example, in the case of leaves, PS could be performed within greenhouses, in the seedling stage, just a few months after crosses are made provided spectra information does not change significantly during maturity. This is comparable to GS, where DNA information is also collected from the leaves of seedlings. In sharp contrast, the use of fruits will require plants to be maintained in the field until fruits can be harvested. For blueberry, this can take at least 1 year after planting, and in some cases, up to 2 or 3 years. For crops with a longer juvenility period, this barrier can preclude the use of fruits as material for PS via NIRS.
Balancing predictive accuracy and cycle length: Implications for projected genetic gains
To answer our last question regarding the balance between predictive accuracy and cycle length, we projected the genetic gains by applying the Breeder's equation. The predictive abilities revealed that genomics and phenomics presented comparable results, particularly for NIRS collected at the fruit level. This pattern has been observed in several studies for multiple crops and traits. In maize, for example, phenomic prediction using seed kernel NIRS outperformed genomic predictions for traits like grain dry matter, grain yield, and phosphorus concentration (Weiß et al., 2022). Similarly, phenomic predictions using dry sugarcane stalks were more accurate than genomic predictions for feedstock quality traits in sugarcane (Gonçalves et al., 2021).
For the breeding cycle, we considered that NIRS collected at the fruit stage will require twice or three more cycles than collecting information at the leave level. When framed in the context of genetic progress, it resulted in lower gains per generation using fruits, when compared to leaves. In fact the use of GS outputs the greatest gains. This is because the method combines higher accuracy, and the possibility to perform early selection. The trade-off is the cost, which makes its practical application more complex when targeting large number of samples.
In agreement with other studies, we found that fruit tissue was the best material for the prediction of soluble solids (Brix) (Basile et al., 2020; Ino et al., 2023). Promising results were also observed in the prediction of firmness and TTA. However, estimations of genetic gain provided more support for the PS using leaf tissue, due to the reduction in the length of the breeding cycle. Given the possibility that NIRS can be collected from leaves during the seedling stage, more crosses (and siblings per cross) can be derived and screened before moving elite material to the field, thereby resulting in greater selection intensity and subsequently higher genetic gains.
Future of PS in blueberry: Perspective and challenges
Our last objective was to project the use of PS at the UFBBP. Briefly, the breeding program performs ∼200 crosses every year, including parents selected among cultivars, elite material, and wild germplasm. Of this total, 20,000 seedlings are planted in non-replicated high-density nurseries. After 1 year, 10% of the best plants are visually selected, and the remaining 2000 plants are genotyped. At this point, breeding decisions are divided in two main directions: (i) population improvement, where the best plants are included in a mating allocation design to be selected as parents for a new recombination cycle, and (ii) product development, in which the best plants are clonally selected and evaluated across different regions for cultivar development. More details about these steps are discussed in detail by Ferrão et al. (2021).
In this framework, we did not argue that PS should replace GS. First, because it is still unclear how NIRS information can predict additive genetic effects, and therefore be useful to guide parental selection. Thus, we emphasize that the use of PS might be risky to be adopted for long-term aiming population improvement. However, the result opens opportunities to investigate the long-term impact of integrating PS as a breeding tool. Several studies have also highlighted the potential issues related to overfitting of PS models. This is because there is significant risk of model bias when phenomic information from certain tissues are used to predict traits. For example, Brault et al. (2022) reported the use of wood and leaf NIRS for prediction and described variable results across populations and traits in grapevine. Similarly, in wheat, Rincent et al. (2018) reported high variability in phenomic prediction methods depending on the trait and tissue measured. Recent findings by Dallinger et al. (2023) demonstrated that phenomic predictions can outperform genomic predictions for certain wheat traits, though the predictions tend to be biased toward the information captured in the predictor data.
An important value of PS in breeding programs would be its application for across population and environmental selections. This potential was partially addressed in this study by training our models in one population to predict a new population. With this, we could find reasonable predictive abilities even while each breeding population was evaluated in different years, while plants were in different phenological stages during NIRS collection. This result compares with the findings reported by DeSalvio et al. (2024) that models containing phenomic information compares with genomics for predicting new genotypes (CV2) and new genotype in new environments (CV00) for maize grain yield and kernel weight. However, it is noteworthy that further validation studies are required, including a more diverse set of environments and populations.
To address some of the questions highlighted in this section, a true validation study is required to confirm whether leaf NIRS from seedlings will be predictive and if leaf spectra change over different maturity stages of the plant. This is relevant because, in this study, spectra data were collected from mature leaves, whereas the true advantage of leaf NIRS lies in the possibility of implementing PS at the seedling stage. Validating the stability of spectral information across maturity stages would also provide insight into the optimal timing for leaf spectra collection to maximize phenomic prediction. Additionally, while a hand-held NIR device was used to collect leaf spectra data in this study, an unmanned aerial vehicle can be employed for phenotyping. This approach could involve the use of drones to collect hyperspectral images due to their ability to measure ultraviolet, visible, and infrared electromagnetic wavebands in high throughput. Consequently, repeated leaf NIR spectral information can be collected uniformly and quickly.
Therefore, considering the aforementioned challenges, we project the greatest long-term gains when genomic and PS are integrated into the same breeding pipeline. Here we demonstrated that the use of multi-omic predictions can increase the prediction ability for all traits, and therefore should be adopted, particularly if both data are collected using the same materials. However, in cases where GS cannot be justified for screening thousands of seedlings, we demonstrate the viability of PS as a cost-effective alternative to support early breeding activities. In the case of blueberry, PS via NIRS could potentially be applied to leaf tissue in the greenhouse while the plants are still seedlings. This is contingent on there being no spectral variation across plant maturity, as it allows for the planting of fewer, more elite genotypes in the field.
CONCLUSIONS
Altogether, we have demonstrated that genetic improvements in blueberry can be accelerated using a combination of phenomic and genomic prediction. Our contributions are threefold: (i) we draw attention to the balance between predictive ability and breeding cycle in the form of genetic gains and showed that NIR spectra collected at blueberry leaves can result in high gains; (ii) we emphasize the importance of using phenomic and genomic prediction as an integrated framework to guide breeders’ decisions in modern breeding programs; (iii) for PS optimization, we tested and discussed the use of multiple statistical models for predicting fruit quality traits and showed that BayesB, RF, and BLUP-based models yielded good results; and finally (iv) positive prediction across environments indicate minimal environmental bias in blueberries. However, strong validation in seedling prediction is required before integrating PS with NIRS in a breeding program for trait improvement. Overall, promising results were reported in blueberry, and we hope it can help the use of multi-omic tools to support fruit breeding programs and higher genetic gains would accelerate timely release of better cultivars to the market.
AUTHOR CONTRIBUTIONS
Paul Adunola: Data curation; formal analysis; methodology; writing—original draft; writing—review and editing. Estefania Tavares Flores: Data curation. Camila Azevedo: Formal analysis; writing—original draft. Gonzalo Casorzo: Formal analysis; investigation; methodology. Lushan Ghimire: Data curation. Luis Felipe V. Ferrão: Conceptualization; data curation; formal analysis; investigation; methodology; supervision; visualization; writing—original draft; writing—review and editing. Patricio R. Munoz: Conceptualization; funding acquisition; project administration; resources; supervision; writing—review and editing.
ACKNOWLEDGMENT
This work was supported by the University of Florida royalty fund procured through licensing of blueberry cultivars from the Blueberry Breeding and Genomics Lab. Many thanks to Juliana Cromie for reviewing and providing valuable suggestions for the final manuscript.
CONFLICT OF INTEREST STATEMENT
The authors declare no conflicts of interest.
DATA AVAILABILITY STATEMENT
Data sets and codes used to fit the phenomic and genomic selection models for blueberry fruit quality traits are available in the supplemental material.
Adunola, P., Ferrao, L. F. V., Azevedo, C., & Munoz, P. R. (2024). Genomic selection optimization in blueberry: Data‐driven methods for marker and training population design. The Plant Genome, 17, [eLocator: e20488]. [DOI: https://dx.doi.org/10.1002/tpg2.20488]
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2024. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the "License"). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
The integration of high‐throughput technologies such as near‐infrared spectroscopy (NIRS) for phenomic‐assisted selection in plant breeding has gained relevance in recent years. In blueberry, the use of phenomic selection could enable selection in the early stages, where thousands of seedlings are visually selected, and the use of genomic selection (GS) is cost‐prohibitive. In this study, we compared phenomic and GS in 372 genotypes, which were phenotyped for multiple fruit quality traits across 2 years. Our contribution is fourfold: (i) phenomic and GS methods have comparable predictive performances for multiple traits; (ii) leaves can achieve the highest genetic gains in the long term among NIRS of different biological tissues (leaf and fruit); (iii) BayesB, mixed models, and random forest resulted in the best predictive results across traits for optimizing phenomic prediction; and finally (iv) attention was drawn to the possibility of using phenomic prediction across environments. Altogether, for the first time in the blueberry literature, the utility of NIRS for phenomic‐assisted selection is demonstrated. While the primary focus is on blueberries, this approach can be evaluated in other fruit trees.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Details





1 Blueberry Breeding and Genomics Lab, Horticultural Sciences Department, University of Florida, Gainesville, Florida, USA
2 Blueberry Breeding and Genomics Lab, Horticultural Sciences Department, University of Florida, Gainesville, Florida, USA, Statistics Department, Federal University of Viçosa, Viçosa, Minas Gerais, Brazil