Genotype-Driven Phenotype Prediction in Onion

Full text

Turn on search term navigation

1. Introduction

Onion (Allium cepa L.) is a highly valued vegetable known for its culinary and medicinal benefits. Onion bulbs contain compounds such as polyols, flavonoids (quercetin and anthocyanins), vitamins, sulfur compounds (like allicin), and essential minerals (potassium and selenium) [1]. These compounds offer health benefits, including anti-inflammatory, anti-cholesterol, anti-cancer, and antioxidant properties, which help in scavenging free radicals and preventing oxidative stress-related degenerative diseases [2,3]. Given these significant health benefits and the increasing global demand, it is essential to develop superior onion varieties through effective breeding programs. However, onion breeding currently faces several important challenges: Onions possess a large genome of approximately sixteen giga bytes, which makes genetic analysis and gene editing particularly challenging [4]. Additionally, their biennial growth cycle extends the duration of breeding programs, delaying the development of new varieties [5,6]. As a self-incompatible crop, onions cannot self-pollinate, complicating the selecting and breeding of desired traits [7]. Moreover, onions are highly susceptible to various diseases, including fungal, viral, and bacterial infections, making the development of disease-resistant varieties crucial yet difficult [8]. Onions are also sensitive to environmental stresses such as temperature and humidity fluctuations, complicating the breeding of climate-resilient varieties [9]. This sensitivity means that onion phenotypic traits are significantly influenced by environmental factors, making it challenging to develop stable varieties [9]. Most important horticultural traits in onions are polygenic, involving complex interactions among multiple genes, thus further complicating the breeding process. Additionally, key productivity-related traits, such as roots, tubers, and bulbs, are located underground, limiting the ability to collect phenotypic data without destructive methods [10,11]. Furthermore, onion genetic resource collections are preserved in gene banks and research facilities worldwide [12,13,14,15,16,17,18]. These resources are vital for maintaining biodiversity and supporting breeding initiatives for this economically significant crop. Genetic resources for onions are crucial for future breeding efforts aimed at addressing various challenges in onion cultivation, such as disease resistance, pest tolerance, and enhanced quality characteristics, as described above [8].

The continuous progress in sequencing technologies and bioinformatics tools is anticipated to strengthen the significance of GWAS in elucidating the genetic basis of complex onion traits and informing breeding approaches. Researchers have employed various techniques to identify high-quality SNPs in the crop. For example, Lee et al. used ddRAD-seq to identify 1904 SNPs based on onion reference scaffolds [19], while Jo et al. utilized genotyping by sequencing (GBS) to generate 10,091 high-fidelity SNP markers and construct an onion genetic map spanning 1383 centimorgans across eight linkage groups. Baldwin et al. performed a QTL analysis to pinpoint genome regions associated with onion bolting [20]. Their study uncovered QTLs linked to bolting susceptibility on chromosomes 1, 3, and 6 [21]. Sudha et al. conducted a principal component analysis (PCA) to identify significant traits differentiating onion cultivars based on morphological and genetic data [22]. Sekine et al. demonstrated that genome selection shows promise in onion breeding, with simulations indicating its potential to enhance genetic gain while mitigating inbreeding depression [23]. Baldwin et al. showed that genomic SSR markers obtained through skim sequencing have provided a quantitative perspective on population genetic variation in onion, enabling DNA fingerprinting, inbreeding level assessment, and population structure analysis for association mapping [20]. However, machine learning has not been explicitly utilized for genome-wide association studies and related techniques in onion breeding research.

As a critical aspect of sustainable agriculture, breeding involves selecting superior varieties that are well suited to various agri-environmental conditions [24,25]. In particular, onion breeding programs prioritize traits such as high yields, early maturity, disease resistance, storage, bulb size, and weight [8]. Artificial intelligence is increasingly being utilized in breeding systems for crops in order to develop resilient varieties using advanced high-throughput techniques, such as genotyping and phenotyping [26,27]. However, root, tuber, and bulb crops like onion are mostly composed of undersoil tissues that are not easily amenable to high-throughput phenotyping compared to genotyping [10,11]. Moreover, the quantitative methods commonly used in crop genetics are linear methods incapable of associating the collective effect of minor genetic markers with specific traits; however, this can be accomplished by non-linear machine learning models [28,29]. Machine learning models are more advanced in identifying patterns from high-dimensional datasets by applying various feature engineering techniques [24,26]. Recently, experts have started to apply genomic selection (GS) and machine learning (ML) to overcome these breeding limitations [26,27]. In particular, the ability to predict phenotypes from genotypes has significantly improved in complex crops such as maize, wheat, and rice. In maize, ML algorithms like Random Forest (RF) and the support vector machine (SVM) have been employed to analyze vast genomic datasets, utilizing tens of thousands of single-nucleotide polymorphisms (SNPs) as markers [4,30]. These markers were pre-selected based on their association with key traits such as yield and disease resistance, thus allowing for accurate phenotype prediction. Similarly, in wheat, deep learning models have utilized over 100,000 SNP markers, selected using techniques such as genome-wide association studies (GWASs), to ensure that they capture the significant genetic variance associated with critical phenotypic traits [31]. In rice, it has been suggested that by performing a GWAS on elite lines to identify associations with specific phenotypic traits and subsequently validating GS effectiveness, rice breeding efficiency could be significantly improved as the costs of GWAS genotyping decrease [32]. Additionally, creating a breeding plan through genotype-based simulations before starting general combining ability-based selection has the advantage of preserving more initial genetic variation while saving time and resources [33,34]. These ML-driven approaches and genotype-based crossbreeding simulations not only accelerate breeding cycles but also improve the precision of superior genotype selection, thus contributing to the development of more resilient and productive crop varieties. Here, we aim to enhance the prediction accuracy of key traits for onion breeding by utilizing extensive onion genome data and various ML algorithms. Additionally, we intend to conduct in silico breeding simulations for the next generation of crosses, thereby proposing more efficient and effective breeding strategies.

This pioneering research forecasts phenotype based on genotype, laying the groundwork for future machine learning applications in onion genomics. The development of genomic resources like SNP markers, genetic maps, and QTL analyses provides data for future machine learning-based association studies in onion. Unlike previous studies focused on selecting individual plants using genomic estimated breeding values (GEBVs), our comprehensive method uses parental genotypes to simulate offspring and identify superior breeding combinations. By creating virtual progeny from parental haplotypes, we predicted the performance of potential offspring and ranked parental combinations for desired traits. This method offers a thorough framework for genomic selection and practical insights into optimal crossbreeding strategies, significantly improving breeding methodologies for complex traits such as bulb weight. Our study enhances traditional breeding by incorporating virtual cross simulations to propose optimal breeding combinations for the next generation, surpassing the mere selection of individuals based on GEBVs. Considering economic constraints in horticultural crop breeding, we developed a machine learning-based approach to predict traits using a minimal set of key SNPs, reducing the need for extensive genotyping and making genomic selection more cost-effective for crops like onions. These advancements mark significant progress in breeding strategies for complex traits.

2. Materials and Methods

2.1. Plant Materials

In total, 98 onion (Allium cepa L.) accessions were obtained from MIRACLE Co., Ltd. (Seoul, Republic of Korea). These included the MO100, MO300, MO400, MO700, MO800, and MO900 varieties and their segregating populations, which were obtained by backcross breeding with onion lines from Japan, Spain, China, and Türkiye. Seeds from each onion accession were sowed in 406-cell trays. At 50 days old, seedlings were transplanted to the field to grow onion bulbs, which were then selected and used in this study. A total of 98 samples were genotyped using sequencing (GBS), as described below, and 260 samples were genotyped using the Fluidigm chip.

2.2. Phenotype Data Collection and Normalization

Three onion traits were examined in collecting phenotype data on the weight, width, and height of bulbs (Supplementary Figure S9). To accurately evaluate individual variations, each characteristic was measured 4 to 6 times per line. Data normalization involved eliminating outliers and calculating average values to construct the final analysis dataset, thus improving data accuracy (Supplementary Figure S1).

2.3. Dataset for Training and Validation Population

In the training population, we used 98 specimens that were categorized into two groups (large and small) according to their bulb weight. The training population was sequenced using the GBS protocol. For the large group, the bulb weights ranged from 320.6 g to 429.0 g, whereas in the small group the bulb weights ranged from 100.0 g to 241.1 g. Furthermore, the large group was treated as the case group and the small group was treated as the control group for all subsequent statistical analyses. To verify the accuracy of the predictive model, a total of 260 samples not used in training were newly genotyped using the Fluidigm chip.

2.4. GBS Library Preparation and Sequencing

Total genomic DNA was extracted from 98 samples, quantified, and normalized to 20 ng/μL. The DNA was then digested with 8 U of high-fidelity PstI at 37 °C for 2 h and heated to 65 °C for 20 min to inactivate the enzyme. The resulting products were purified using a QIAquick PCR Purification Kit (Qiagen, Hilden, Germany), and the distribution of fragment sizes was analyzed using a BioAnalyzer 2100 instrument (Agilent Technologies, Santa Clara, CA, USA). The GBS libraries were then sequenced using the Illumina NextSeq500 platform (San Diego, CA, USA), with 150 bp single reads in egnome, an authorized service provider.

2.5. Variant Calling from GBS Data

The sequences obtained through Illumina’s GBS methodology underwent subsequent processing after being subjected to quality and adapter trimming with Trimmomatic 0.39 using the following default parameter settings [35]. The processed reads were mapped to the onion reference genome Sequon v1.2 (www.oniongenome.wur.nl, accessed on 7 November 2024) [36] using BWA v0.7.17 [37], and variant calling was performed with the Haplotype caller in the Genome Analysis Toolkit (GATK) [38]. SNPs were selected with GATK v4.2.4.1 parameters, i.e., normalized quality score ≥ 2 and mapping quality ≥ 40. SNPs were annotated with SnpEff v.4.3 [39]. Finally, high-quality SNPs with the following properties were selected: (1) bi-allelic sites, (2) genotyping rate of the samples at each variable site ≥ 0.9, (3) minor allele frequency (MAF) ≥ 0.03, and (4) Hardy–Weinberg equilibrium (HWE) ≥ 1 × 10⁻³ using vcftools v0.1.16 [40] and PLINK v1.90b6.21 [41].

2.6. Population Structure Analysis

Population structure was analyzed using Admixture software (v1.3.0) utilizing 51,499 high-quality SNPs [42]. Based on the Bayesian model-based clustering method, default parameters were used with k-values ranging from 2 to 10 [43]. Principal component analysis (PCA) was conducted using the—pca command in PLINK software (v1.90b6.21) to confirm population stratification [44]. Eigengene vectors and variances were calculated based on high-quality SNPs. Identity-by-State (IBS) analysis was conducted using PLINK [45]. IBS analysis was performed in a pairwise manner for each individual, producing an n x n matrix of values. The distribution of IBS values within the population was visualized using a histogram chart generated using R software (v4.1.2).

2.7. GBLUP Model for Genomic Evaluation

In this study, genetic evaluation was conducted using a large amount of genetic variation. A simple mixed model was created using the “rrBLUP” R package with the phenotypic and genotypic information for which QC was completed, and the SNP effect was estimated using the acquired breeding value and genotype data [46].

$y = X b + Z u + e$

where X is the design matrix for the fixed effects, b (line, phenotype measured date); Z is the design matrix for the random effects, u; and the residuals are normal with constant variance.

2.8. Marker Selection with GWAS and Feature Selection

A genome-wide association study (GWAS) was conducted using a categorical association approach (case vs. control) to investigate the genotype–phenotype relationship. SNPs were considered significantly associated with the trait if p < 0.01. All analyses were carried out using the—assoc function in PLINK. A total of 39 SNP markers were selected from the GBS dataset using both GWAS and machine learning-based variant importance information for machine learning model construction. To reduce computational time, SNPs were prioritized based on p-values (p-value < 0.1) obtained from association analyses between the large (case) and small (control) groups. The importance of each SNP was calculated using machine learning models based on the selected genotypic information. Since the variable importance scores showed significant variation between different models, standardized values were used to prevent data bias. SNPs with an average variable importance score greater than or equal to zero across models were selected as key SNPs.

2.9. Machine Learning Model Construction

Supervised ML was used to construct models to attain greater predictive power from the high-dimensional datasets. Here, we constructed models using significant 39 SNPs (as features), which were selected from the GWAS and feature selection. For the purpose of further statistical analyses, the larger category was treated as the case group, while the smaller category functioned as the control group. Bulb weights in the larger category ranged from 320.6 g to 429.0 g, whereas those in the smaller category fell between 100.0 g and 241.1 g The classification models were constructed from seven supervised learning models: AdaBoost (AB), Bagged Tree (BAG), Generalized Boosted Regression (GBM), Boosted Logistic Regression (LOGI), partial least squares (PLS), Random Forest (RF), and the support vector machine (SVM) by the caret v6.0.94 package of R v4.3.3 software [47]. All the above machines were used with default parameters. To evaluate the prediction models’ effectiveness, we utilized confusion matrices and ROC curves. For a quantitative comparison of the latter, we calculated the AUC and employed a two-tailed Student’s t-test to determine significant differences between curves. We followed the evaluation metric procedures outlined by Manavalan et al. [48,49]. The plotROC R package [50] was used to generate the ROC curves and compute the AUC. Our resampling approach for learning involved repeated cross-validation. Specifically, we conducted 15 learning iterations with 10 folds using 60% of the total learning data, while the remaining 40% was reserved for validation purposes. To ensure result reproducibility, we set the seed value to “100” for both learning and testing processes.

2.10. Targeted Fluidigm Chip Design and Genotyping for Prediction Model Validation

The 96 markers selected—through both GWAS analysis and feature selection using machine learning algorithms—for their strong association with onion bulb weight were utilized to design a custom Fluidigm chip. This genotyping approach was implemented to validate the predictive accuracy of the machine learning model using these markers, ensuring precise confirmation of the genotypes in the 260 validation samples. The target SNP genotyping chip was constructed with a Fluidigm 96 dynamic array integrated fluidic circuit (IFC) using an Adventa sample ID genotyping panel. For chip design, the primers for each SNP were selected as 100 bp in the flanking regions. Primers such as allele-specific primers (ASPs), locus-specific primers (LSPs), and specific target amplification primers (STAs) were designed using the Fluidigm SNP type assay protocol. The PCR cocktail was prepared with ASPs, LSPs, and STAs, according to the manufacturer’s protocol, along with the high-quality DNA prepared from each sample. The samples were loaded in a 96-well plate (12 columns × 8 rows) and subjected to the SNPtype 96 × 96 thermal cycling protocol with an FC1 PCR cycler to detect fluorescence by Biomark HD and processed with the SNP Trace™ Panel Analysis tool in the SNP genotyping analysis software (https://www.standardbio.com, accessed on 7 November 2024). Genomic DNA (gDNA) quality was assessed based on the concentration (ng/μLS) of each sample, which was measured with a Biotek Epoch spectrometer at 260/280 nm. All experimental protocols were performed by the TNT research service provider in Anyang, South Korea. The Fluidigm chip was designed following the method described by Noh et al., and all related services were provided by TNT Research Services in Anyang, South Korea [51].

2.11. In Silico Crossbreeding Simulations

We generated in silico offspring considering genetic distances across the genome. First, we obtained haplotype phasing information from 39 real high-quality SNP genotype samples using Eagle v2.4.1 software [52]. These haplotypes were used to estimate corresponding marginal distributions and pairwise correlation coefficients using the “hapsim” and “sim1000G” package in R [53,54]. The estimated coefficients were used as recombination probabilities to generate in silico offspring of chromosome-level genotypes. The resulting genotype data were merged using PLINK v1.90b6.21 software, with the original samples providing the haplotype information being recorded as sire/dam samples.

2.12. Crossing Efficiency Simulation

We used the machine learning model constructed to assess the effect according to the crossbred combination. The genotype of the generated in silico offspring was used as the input value to produce the estimate for each machine learning model. These values provide the probability that the bulb weight of the onion would be classified as large/small. We proposed the optimal and worst crossbred combinations based on the classification probability distribution of the generated in silico offspring.

3. Results

3.1. Population Structure and Genetic Diversity Analysis

Our research employed 98 DH (doubled haploid) genetic resources to enhance onion bulb size. We acquired genotype data through genotyping by sequencing (GBS), generating 137.09 Gb of sequencing data from 98 samples, with each sample yielding approximately 1.37 Gb. The reads were aligned to the Allium Cepa v1.2 reference onion genome, achieving an 85.7% mapping rate (Supplementary Figure S2). The mapped reads covered about 17,600,957 bases, constituting 0.1% of the entire genome, as shown in the variant flowchart and results (Supplementary Figure S3). Initially, 4,358,417 SNPs were identified from the mapped reads. After applying stringent quality control filters (as shown in Supplementary Figure S4 and described in Section 2, “Variant Calling from GBS Data”), we identified a total of 51,499 high-quality single-nucleotide polymorphisms (HQ-SNPs). The majority of these SNPs are located in intergenic regions (Supplementary Figure S5). Based on these HQ-SNPs, the STRUCTURE method’s sub-population analysis grouped the samples into three clusters (Figure 1A). Furthermore, IBS genetic affinity assessment revealed two additional clusters with 0.75 and 0.85 admixed scores, resulting in five sub-populations that showed strong associations with bulb weight (Figure 1B). This pattern was also evident in the PCA analysis (Figure 1C).

3.2. Genomic Estimated Breeding Values (GEBVs) That Estimate BVs and Phenotypes from Genotype Information

In this study, we performed a comprehensive analysis to identify markers associated with quantitative traits such as onion bulb weight. Approximately 14,143 SNPs associated with these quantitative traits were explored to provide a basis for constructing genomic breeding values (GEBVs) using the Genomic Best Linear Unbiased Prediction (GBLUP) model. The GBLUP model can assign weights to these markers to more accurately predict breeding values. Considering that the traits under consideration are quantitative, regression analysis was used to estimate the effect of each SNP on the phenotype (heritability, G) and develop a prediction model. This model only used genotypic information to predict values for traits such as bulb weight without considering environmental factors. The results showed high accuracy, with R² values exceeding 0.97 for all traits, which confirmed the robustness of the model predicting phenotypic values based only on genomic data (Figure 2A). The estimation of breeding value varied depending on the number of SNPs used in the GBLUP model. When 96 SNPs were used for bulb weight, R² was 0.9784, and when 39 were used, R² was 0.7293 (Figure 2B,C). However, when the number of SNPs increased to 14,143, the R² value reached 0.9794, indicating that the prediction accuracy increased as the number of markers increased. To evaluate the predictive accuracy of the GBLUP model for bulb weight estimation, 96 SNPs strongly associated with the trait were genotyped using a Fluidigm chip across 260 new samples. The genotype-based predicted values generated by the GBLUP model were compared to the actual phenotypic measurements. The validation results revealed a discrepancy of 20–40 g between the predicted and observed weights for most samples, with the majority of predictions deviating by less than 100 g (Supplementary Figure S6). The correlation coefficient (R²) between the predicted and actual values using 96 and 39 SNPs was 0.22, indicating that the model’s predictive performance diminishes significantly when the number of markers utilized is below 100, especially when applied to novel genetic backgrounds (Supplementary Figure S7).

3.3. Advanced SNP Selection and Machine Learning (ML) for Onion Bulb Weight Prediction

To enhance the predictive modeling of onion bulb weight, we implemented a machine learning-driven approach for refining SNP marker selection. Initially, samples from a genome-wide association study (GWAS) on quantitative trait association were stratified into case–control groups based on bulb weight, representing the top 30% and bottom 30% of phenotypes (Figure 3A). This transformation of continuous traits into discrete categories enabled more robust association analysis, leading to the identification of 1000 significant bulb weight-related SNPs (Supplementary Figure S8). These markers were then subjected to a feature selection process within a machine learning framework, further refining them to 39 key SNPs that demonstrated strong associations with bulb weight and were highly informative for predictive modeling. The predictive power of these 39 SNP markers was validated by their ability to effectively distinguish between high and low bulb weight groups in a case–control format, underscoring their potential for integration into machine learning models for genomic selection and breeding programs. Subsequently, several advanced machine learning algorithms, such as AdaBoost, Bagged Tree, Gradient Boosting, Logistic Regression, partial least squares (PLS), Random Forest, and the support vector machine (SVM), were employed to develop predictive models, with data split into 70% training and 30% testing sets to evaluate model performance in terms of accuracy, sensitivity, and specificity. Among these, AdaBoost, Random Forest, and PLS demonstrated the highest prediction accuracy, each achieving 93.2% (Figure 3B). The models were further validated using 121 (65 large; 66 small) new samples, which were classified as “high” or “low” based on a phenotypic threshold (275 g) established from the elite breeding lines (Figure 3C). The PLS model outperformed others, achieving the highest accuracy of 83.2% in distinguishing between “high” and “low” categories, underscoring its potential for robust phenotypic prediction in genomic selection and breeding programs (Figure 3D).

3.4. Simulation of Crossing Using In Silico Offspring Genotypes

Using genomic data from 98 elite onion lines, we conducted an in silico simulation to generate 500,000 virtual offspring, leveraging the haplotype information derived from these elite lines. This enabled us to estimate the expected bulb weight for each cross using our previously established machine learning (ML) model, specifically designed around 39 key SNPs associated with bulb weight. Unlike traditional GBLUP models, which were not utilized in this analysis due to their limited predictive accuracy under marker number constraints, the ML model provided robust predictions by capturing complex genetic architectures. We utilized the genotypes of these virtual offspring to predict the probability of achieving a bulb weight of 250 g or more, using the ML model to assess their probability distributions. When examining the actual observed bulb weights of the parental lines for the top 5% and bottom 5% of these virtual offspring, we found that the parents of the top 5% were ranked higher based on their observed bulb weight, while those of the bottom 5% offspring were ranked lower (Figure 4). This validation confirms the reliability of our ML-driven predictions.

Based on these results, we were able to rank all possible breeding combinations to identify those most likely to produce high-performing offspring. The simulation outcomes provide a powerful framework for data-driven breeding strategies, demonstrating how the integration of genomic data and machine learning models can simulate and predict various breeding combination outcomes.

4. Discussion

Our research presents a novel approach to crop improvement by integrating machine learning (ML) with genomic selection (GS) in onion breeding (Figure 5). Like other horticultural crops, onions face significant breeding challenges due to their large genome size, biennial growth cycle, and complex polygenic traits [1,55]. Additionally, whole-genome re-sequencing for large genomes is costly and may not provide complete coverage when using short reads [56]. Our study offers a refined, data-driven breeding method that overcomes traditional limitations by employing state-of-the-art bioinformatics tools, genetic data, and computer simulations [26,27]. The conventional genomic selection-based model initially showed good predictive performance within the same dataset, but its accuracy decreased when applied to external data. The experimental results demonstrated that GBS data correlated well with estimates (0.979) for onion weight, while the Fluidigm data correlation dropped to 0.220. However, the classification-based ML model performed well across GBS (0.932) and the Fluidigm chip (0.832). After confirming the classification model’s superior performance against the conventional genomic selection predictive model, we addressed the challenge of feature selection for model deployment. A similar approach is used in animal breeding, since the sampling size is often limited [57]. We developed a workflow to create primary models using top-ranked GWAS SNPs, selected the best features based on the importance of the best models, and then chose the optimal features for secondary models. A similar approach has also been used for oil seed plant genome selection to ensure that minimal features can perform well in classification models [58]. This iterative process allows non-breeders to execute the computational pipeline to identify the best variants for a given trait [57,59]. In our study, we initially selected 1000 SNPs from GWAS and reduced them to 39 SNPs, achieving accuracies of 0.97 and 0.78. As we observed, the conversion of continuous traits into discrete categories enabled a more robust association analysis with classification machine learning models.

Another crucial finding from this study is the confirmation of ML models through in silico crossbreeding simulations. The ability to forecast the outcomes of potential breeding crosses before physical implementation has profound implications for expediting and enhancing breeding programs. Our simulations revealed that virtual progeny generated from superior onion lines displayed consistent results when ranked by bulb weight, validating the model’s predictive power in a practical breeding context. Ranking crossbred combinations based on genotype data enables breeders to prioritize crosses with the highest probability of producing desired phenotypes. This approach not only diminishes the time and expenses associated with traditional phenotypic selection but also offers a more targeted breeding strategy. From a bioinformatics standpoint, the incorporation of ML models such as AdaBoost, Random Forest, and PLS for phenotype prediction signifies a transition from conventional statistical models to more advanced methods capable of capturing non-linear relationships within genomic data. These models provide enhanced adaptability and resilience, particularly in managing the complex genetic architectures linked to polygenic traits. The success of these models in predicting bulb weight further emphasizes the potential of machine learning in plant breeding, especially for traits that are challenging to measure directly or are influenced by multiple environmental and genetic factors.

This research offers significant insights into the broader utilization of regression and classification machine learning models and genomic selection in plant breeding. The strong correlation between predicted and actual phenotype groups indicates that comparable methods could be applied to other economically significant traits, including disease resistance, stress tolerance, and yield. As the cost of genotyping decreases and computational resources become more readily available, ML-driven breeding strategies are likely to become more common across various crops, such as Castanea crenata and Platycodon grandiflorus [60,61], enhancing both the speed and accuracy of breeding efforts. Nevertheless, several obstacles persist. While we successfully reduced SNP markers to 39 for this study, the applicability of this approach across different environments and onion populations requires further investigation. Genotype-by-environment interactions, which we did not fully address in this study, could impact the accuracy of genotype-based predictions. Future studies should consider incorporating environmental data into the models, potentially using multi-environment trials to ensure reliability across diverse growing conditions. Furthermore, despite the good performance of the machine learning models, the computational expense of training these models, especially with larger datasets, should be considered for widespread implementation.

To summarize, this study presents a compelling argument for incorporating machine learning with genomic selection in onion breeding. By optimizing the process of SNP marker selection and validating predictions through in silico simulations, we showcase a powerful, data-driven method to improve crop breeding efficiency. The knowledge gained from this research has the potential to significantly influence future breeding strategies, not only for onions but also for a wide range of horticultural and agricultural crops. As genomic technologies continue to advance, the convergence of bioinformatics and plant breeding will undoubtedly lead to more precise, efficient, and sustainable approaches to crop improvement.

5. Conclusions

This study demonstrates the successful integration of machine learning and genomic selection techniques to enhance onion breeding efficiency. By employing a novel approach that combines genome-wide association studies, feature selection, and machine learning models, we identified a set of 39 SNP markers highly predictive of onion bulb weight. The partial least squares (PLS) model achieved 83.2% accuracy in distinguishing between high and low bulb weight categories using these markers, outperforming traditional genomic selection methods. By validating that the top-performing simulated offspring originated from parents with higher actual bulb weights, we strengthened the reliability of our machine learning-based predictions. This approach allows breeders to prioritize crosses with the highest probability of producing desired phenotypes, potentially reducing the time and resources required in traditional breeding programs. This study’s success in reducing the number of genetic markers while maintaining high predictive accuracy represents a significant advancement in crop genomics. This method not only improves the efficiency of onion breeding but also has potential applications in other crops with complex genetic architectures. While challenges remain, such as in accounting for genotype-by-environment interactions and computational demands, this research provides a robust framework for data-driven breeding strategies. As genomic technologies continue to advance, the integration of bioinformatics and plant breeding promises more precise, efficient, and sustainable approaches to crop improvement, addressing global food security challenges in the face of climate change and increasing demand.

Author Contributions

Conceptualization, S.C. (Sunghyun Cho) and M.J.; Data curation, S.C. (Sunghyun Cho), Y.-j.L., E.L. and J.L.; Formal analysis, S.C. (Sunghyun Cho), Y.-j.L., J.L. and Y.S.; Funding acquisition, J.C., S.C. (Subin Choi) and Y.S.; Investigation, S.C. (Sunghyun Cho) and H.Y.P.; Methodology, J.C., S.C. (Sunghyun Cho), S.C. (Subin Choi), M.J., H.Y.P. and Y.S.; Project administration, J.C., M.J. and Y.S.; Resources, J.C., S.C. (Subin Choi), H.Y.P. and Y.S.; Software, S.C. (Sunghyun Cho), Y.-j.L. and E.L.; Supervision, S.C. (Subin Choi), M.J. and Y.S.; Visualization, S.C. (Sunghyun Cho), E.L. and J.L.; Writing—original draft, J.C., S.C. (Sunghyun Cho) and Y.S.; Writing—review and editing, J.C., S.C. (Subin Choi), M.J., Y.-j.L., H.Y.P. and Y.S. All authors have read and agreed to the published version of the manuscript.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The complete sequences generated in this study were deposited in the NCBI Sequence Read Archive under accession no. PRJNA1182546.

Conflicts of Interest

S.C., M.J., Y.-j.L., E.L., J.L. and Y.S. are employed by Insilicogen Inc. J.C. is employed by MIRACLE Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Figures

View Image - Figure 1. Population structure and genetic diversity results. (A) Population structure results for optimized k-values. Onion lines could be grouped into three main genetic clusters. (B) The genetic similarity results of the onion group confirmed through the IBS results. The overall mean value was confirmed to be 0.75. (C) PCA results with added bulb weights for onions. There was no confirmed clustering tendency according to phenotypic distribution.

Figure 1. Population structure and genetic diversity results. (A) Population structure results for optimized k-values. Onion lines could be grouped into three main genetic clusters. (B) The genetic similarity results of the onion group confirmed through the IBS results. The overall mean value was confirmed to be 0.75. (C) PCA results with added bulb weights for onions. There was no confirmed clustering tendency according to phenotypic distribution.

View Image - Figure 2. Correlation plots from conventional genomic selection-based estimates with the observed phonotypes: (A) GBS dataset with 14,143 SNP weight, (B) 96 SNP weight, and (C) 39 SNP weight.

Figure 2. Correlation plots from conventional genomic selection-based estimates with the observed phonotypes: (A) GBS dataset with 14,143 SNP weight, (B) 96 SNP weight, and (C) 39 SNP weight.

View Image - Figure 3. (A) Distribution of onion bulb weight for training dataset. (B) Machine learning prediction ROC curve along with sensitivity and specificity scores for validation data. (C) Distribution of onion bulb weight phenotypes for validation dataset. (D) Box plot of validation dataset showing the prediction probabilities of the machines for the two classes, namely, high (orange) and low (Blue).

Figure 3. (A) Distribution of onion bulb weight for training dataset. (B) Machine learning prediction ROC curve along with sensitivity and specificity scores for validation data. (C) Distribution of onion bulb weight phenotypes for validation dataset. (D) Box plot of validation dataset showing the prediction probabilities of the machines for the two classes, namely, high (orange) and low (Blue).

Figure 4. Comparison of top 5% prediction probability (A) and GBEV estimated bulb weight (B). g: gram.

View Image - Figure 5. The proposed digital breeding approach to reach the minimal feature to construct the high-predictive machine learning model for desired traits. The onion bulb size trait was assessed with the proposed model to attain 39 SNPs to predict bulb size.

Figure 5. The proposed digital breeding approach to reach the minimal feature to construct the high-predictive machine learning model for desired traits. The onion bulb size trait was assessed with the proposed model to attain 39 SNPs to predict bulb size.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/agriculture14122239/s1, Figure S1: The distribution of phenotype for (A) train data set and (B) validation dataset. The observed onion bulb weight and QC is estimated bulb weight; Figure S2: A summary of the mapping of bases to the onion reference genome; Figure S3: A comprehensive summary of mapped bases based on genomic features; Figure S4: The figure illustrates the variant calling workflow, including the outcomes; Figure S5: A comprehensive summary of called variants based on genomic features; Figure S6: Comparison of predicted and observed values of validation data; Figure S7: GBLUP line plot with validation data set (A) 96 SNPs weight, (B) 39 SNPs weight; Figure S8: The Association analysis with Manhattan plots of -Log10 (P) vs. chromosomal position of SNP markers associated and QQ plots for three horticulture traits i.e., WT: Weight (A), HT: height (B), WD: weight (C); Figure S9: Correlation co-efficient between the observed and GBULP estimated phenotype weight for two horticultural traits (a) onion width(WD), (b) onion height (HT).

References

1. Hao, F.; Liu, X.; Zhou, B.; Tian, Z.; Zhou, L.; Zong, H.; Qi, J.; He, J.; Zhang, Y.; Zeng, P. et al. Chromosome-level genomes of three key Allium crops and their trait evolution. Nat. Genet.; 2023; 55, pp. 1976-1986. [DOI: https://dx.doi.org/10.1038/s41588-023-01546-0] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/37932434]

2. Stoica, F.; Rațu, R.N.; Veleșcu, I.D.; Stănciuc, N.; Râpeanu, G. A comprehensive review on bioactive compounds, health benefits, and potential food applications of onion (Allium cepa L.) skin waste. Trends Food Sci. Technol.; 2023; 141, 104173. [DOI: https://dx.doi.org/10.1016/j.tifs.2023.104173]

3. Elattar, M.M.; Darwish, R.S.; Hammoda, H.M.; Dawood, H.M. An ethnopharmacological, phytochemical, and pharmacological overview of onion (Allium cepa L.). J. Ethnopharmacol.; 2024; 324, 117779. [DOI: https://dx.doi.org/10.1016/j.jep.2024.117779] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/38262524]

4. Alemu, A.; Astrand, J.; Montesinos-Lopez, O.A.; Isidro, Y.S.J.; Fernandez-Gonzalez, J.; Tadesse, W.; Vetukuri, R.R.; Carlsson, A.S.; Ceplitis, A.; Crossa, J. et al. Genomic selection in plant breeding: Key factors shaping two decades of progress. Mol. Plant; 2024; 17, pp. 552-578. [DOI: https://dx.doi.org/10.1016/j.molp.2024.03.007] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/38475993]

5. Dong, Y.; Cheng, Z.; Meng, H.; Liu, H.; Wu, C.; Khan, A.R. The effect of cultivar, sowing date and transplant location in field on bolting of Welsh onion (Allium fistulosum L.). BMC Plant Biol.; 2013; 13, 154. [DOI: https://dx.doi.org/10.1186/1471-2229-13-154]

6. Gedam, P.A.; Thangasamy, A.; Shirsat, D.V.; Ghosh, S.; Bhagat, K.P.; Sogam, O.A.; Gupta, A.J.; Mahajan, V.; Soumia, P.S.; Salunkhe, V.N. et al. Screening of Onion (Allium cepa L.) Genotypes for Drought Tolerance Using Physiological and Yield Based Indices Through Multivariate Analysis. Front. Plant Sci.; 2021; 12, 600371. [DOI: https://dx.doi.org/10.3389/fpls.2021.600371]

7. Singh, H.; Sekhon, B.S.; Kumar, P.; Dhall, R.K.; Devi, R.; Dhillon, T.S.; Sharma, S.; Khar, A.; Yadav, R.K.; Tomar, B.S. et al. Genetic Mechanisms for Hybrid Breeding in Vegetable Crops. Plants; 2023; 12, 2294. [DOI: https://dx.doi.org/10.3390/plants12122294]

8. Cramer, C.S.; Mandal, S.; Sharma, S.; Nourbakhsh, S.S.; Goldman, I.; Guzman, I. Recent Advances in Onion Genetic Improvement. Agronomy; 2021; 11, 482. [DOI: https://dx.doi.org/10.3390/agronomy11030482]

9. Nourbakhsh, S.S.; Cramer, C.S. Onion Plant Size Measurements as Predictors for Onion Bulb Size. Horticulturae; 2022; 8, 682. [DOI: https://dx.doi.org/10.3390/horticulturae8080682]

10. Song, P.; Wang, J.; Guo, X.; Yang, W.; Zhao, C. High-throughput phenotyping: Breaking through the bottleneck in future crop breeding. Crop J.; 2021; 9, pp. 633-645. [DOI: https://dx.doi.org/10.1016/j.cj.2021.03.015]

11. Sun, Y.; Shang, L.; Zhu, Q.H.; Fan, L.; Guo, L. Twenty years of plant genome sequencing: Achievements and challenges. Trends Plant Sci.; 2022; 27, pp. 391-401. [DOI: https://dx.doi.org/10.1016/j.tplants.2021.10.006] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/34782248]

12. Ochar, K.; Kim, S.-H. Conservation and Global Distribution of Onion (Allium cepa L.) Germplasm for Agricultural Sustainability. Plants; 2023; 12, 3294. [DOI: https://dx.doi.org/10.3390/plants12183294] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/37765458]

13. Mallor, C.; Arnedo-Andrés, M.S.; Garcés-Claver, A. Assessing the genetic diversity of Spanish Allium cepa landraces for onion breeding using microsatellite markers. Sci. Hortic.; 2014; 170, pp. 24-31. [DOI: https://dx.doi.org/10.1016/j.scienta.2014.02.040]

14. Rouamba, A.; Robert, T.; Sarr, A.; Ricroch, A. A preliminary germplasm evaluation of onion landraces from West Africa. Genome; 1996; 39, pp. 1128-1132. [DOI: https://dx.doi.org/10.1139/g96-142]

15. Suojala-Ahlfors, T.; Heinonen, M.; Tanhuanpää, P.; Antonius, K. Rich diversity in cultivated Finnish potato onions (Allium cepa var. aggregatum G. Don). Genet. Resour. Crop Evol.; 2022; 69, pp. 1547-1555. [DOI: https://dx.doi.org/10.1007/s10722-021-01317-y]

16. Arena, D.; Ben Ammar, H.; Major, N.; Kovačević, T.K.; Goreta Ban, S.; Al Achkar, N.; Rizzo, G.F.; Branca, F. Diversity of the Morphometric and Biochemical Traits of Allium cepa L. Varieties. Plants; 2024; 13, 1727. [DOI: https://dx.doi.org/10.3390/plants13131727]

17. Villano, C.; Esposito, S.; Carucci, F.; Iorizzo, M.; Frusciante, L.; Carputo, D.; Aversano, R. High-throughput genotyping in onion reveals structure of genetic diversity and informative SNPs useful for molecular breeding. Mol. Breed.; 2018; 39, 5. [DOI: https://dx.doi.org/10.1007/s11032-018-0912-0]

18. Shukla, S.; Iquebal, M.A.; Jaiswal, S.; Angadi, U.B.; Fatma, S.; Kumar, N.; Jasrotia, R.S.; Fatima, Y.; Rai, A.; Kumar, D. The Onion Genomic Resource: A genomics and bioinformatics driven resource for onion breeding. Plant Gene; 2016; 8, pp. 9-15. [DOI: https://dx.doi.org/10.1016/j.plgene.2016.09.003]

19. Lee, J.-H.; Natarajan, S.; Biswas, M.K.; Shirasawa, K.; Isobe, S.; Kim, H.-T.; Park, J.-I.; Seong, C.-N.; Nou, I.-S. SNP discovery of Korean short day onion inbred lines using double digest restriction site-associated DNA sequencing. PLoS ONE; 2018; 13, e0201229. [DOI: https://dx.doi.org/10.1371/journal.pone.0201229]

20. Baldwin, S.; Pither-Joyce, M.; Wright, K.; Chen, L.; McCallum, J. Development of robust genomic simple sequence repeat markers for estimation of genetic diversity within and among bulb onion (Allium cepa L.) populations. Mol. Breed.; 2012; 30, pp. 1401-1411. [DOI: https://dx.doi.org/10.1007/s11032-012-9727-6]

21. Baldwin, S.; Revanna, R.; Pither-Joyce, M.; Shaw, M.; Wright, K.; Thomson, S.; Moya, L.; Lee, R.; Macknight, R.; McCallum, J. Genetic analyses of bolting in bulb onion (Allium cepa L.). Theor. Appl. Genet.; 2014; 127, pp. 535-547. [DOI: https://dx.doi.org/10.1007/s00122-013-2232-4] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/24247236]

22. Sudha, G.S.; Ramesh, P.; Sekhar, A.C.; Krishna, T.S.; Bramhachari, P.V.; Riazunnisa, K. Genetic diversity analysis of selected Onion (Allium cepa L.) germplasm using specific RAPD and ISSR polymorphism markers. Biocatal. Agric. Biotechnol.; 2019; 17, pp. 110-118. [DOI: https://dx.doi.org/10.1016/j.bcab.2018.11.007]

23. Sekine, D.; Yabe, S. Simulation-based optimization of genomic selection scheme for accelerating genetic gain while preventing inbreeding depression in onion breeding. Breed. Sci.; 2020; 70, pp. 594-604. [DOI: https://dx.doi.org/10.1270/jsbbs.20047] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/33603556]

24. Hayes, B.J.; Chen, C.; Powell, O.; Dinglasan, E.; Villiers, K.; Kemper, K.E.; Hickey, L.T. Advancing artificial intelligence to help feed the world. Nat. Biotechnol.; 2023; 41, pp. 1188-1189. [DOI: https://dx.doi.org/10.1038/s41587-023-01898-2] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/37524959]

25. Wallace, J.G.; Rodgers-Melnick, E.; Buckler, E.S. On the Road to Breeding 4.0: Unraveling the Good, the Bad, and the Boring of Crop Quantitative Genomics. Annu. Rev. Genet.; 2018; 52, pp. 421-444. [DOI: https://dx.doi.org/10.1146/annurev-genet-120116-024846]

26. van Dijk, A.D.J.; Kootstra, G.; Kruijer, W.; de Ridder, D. Machine learning in plant science and plant breeding. iScience; 2021; 24, 101890. [DOI: https://dx.doi.org/10.1016/j.isci.2020.101890]

27. Tong, H.; Nikoloski, Z. Machine learning approaches for crop improvement: Leveraging phenotypic and genotypic big data. J. Plant Physiol.; 2021; 257, 153354. [DOI: https://dx.doi.org/10.1016/j.jplph.2020.153354]

28. John, M.; Haselbeck, F.; Dass, R.; Malisi, C.; Ricca, P.; Dreischer, C.; Schultheiss, S.J.; Grimm, D.G. A comparison of classical and machine learning-based phenotype prediction methods on simulated data and three plant species. Front. Plant Sci.; 2022; 13, 932512. [DOI: https://dx.doi.org/10.3389/fpls.2022.932512]

29. Silva, P.P.; Gaudillo, J.D.; Vilela, J.A.; Roxas-Villanueva, R.M.L.; Tiangco, B.J.; Domingo, M.R.; Albia, J.R. A machine learning-based SNP-set analysis approach for identifying disease-associated susceptibility loci. Sci. Rep.; 2022; 12, 15817. [DOI: https://dx.doi.org/10.1038/s41598-022-19708-1]

30. Crossa, J.; Perez-Rodriguez, P.; Cuevas, J.; Montesinos-Lopez, O.; Jarquin, D.; de Los Campos, G.; Burgueno, J.; Gonzalez-Camacho, J.M.; Perez-Elizalde, S.; Beyene, Y. et al. Genomic Selection in Plant Breeding: Methods, Models, and Perspectives. Trends Plant Sci.; 2017; 22, pp. 961-975. [DOI: https://dx.doi.org/10.1016/j.tplants.2017.08.011]

31. Norman, A.; Taylor, J.; Edwards, J.; Kuchel, H. Optimising Genomic Selection in Wheat: Effect of Marker Density, Population Size and Population Structure on Prediction Accuracy. G3 Genes Genomes Genet.; 2018; 8, pp. 2889-2899. [DOI: https://dx.doi.org/10.1534/g3.118.200311] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/29970398]

32. Spindel, J.; Begum, H.; Akdemir, D.; Virk, P.; Collard, B.; Redona, E.; Atlin, G.; Jannink, J.L.; McCouch, S.R. Correction: Genomic Selection and Association Mapping in Rice (Oryza sativa): Effect of Trait Genetic Architecture, Training Population Composition, Marker Number and Statistical Model on Accuracy of Rice Genomic Selection in Elite, Tropical Rice Breeding Lines. PLoS Genet.; 2015; 11, e1005350. [DOI: https://dx.doi.org/10.1371/journal.pgen.1005350]

33. Peixoto, M.A.; Coelho, I.F.; Leach, K.A.; Lubberstedt, T.; Bhering, L.L.; Resende, M.F.R., Jr. Use of simulation to optimize a sweet corn breeding program: Implementing genomic selection and doubled haploid technology. G3 Genes Genomes Genet.; 2024; 14, jkae128. [DOI: https://dx.doi.org/10.1093/g3journal/jkae128]

34. Krenzer, D.; Frisch, M.; Beckmann, K.; Kox, T.; Flachenecker, C.; Abbadi, A.; Snowdon, R.; Herzog, E. Simulation-based establishment of base pools for a hybrid breeding program in winter rapeseed. Theor. Appl. Genet.; 2024; 137, 16. [DOI: https://dx.doi.org/10.1007/s00122-023-04519-3]

35. Bolger, A.M.; Lohse, M.; Usadel, B. Trimmomatic: A flexible trimmer for Illumina sequence data. Bioinformatics; 2014; 30, pp. 2114-2120. [DOI: https://dx.doi.org/10.1093/bioinformatics/btu170]

36. Finkers, R.; van Kaauwen, M.; Ament, K.; Burger-Meijer, K.; Egging, R.; Huits, H.; Kodde, L.; Kroon, L.; Shigyo, M.; Sato, S. et al. Insights from the first genome assembly of Onion (Allium cepa). G3 Genes Genomes Genet.; 2021; 11, jkab243. [DOI: https://dx.doi.org/10.1093/g3journal/jkab243]

37. Li, H.; Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics; 2009; 25, pp. 1754-1760. [DOI: https://dx.doi.org/10.1093/bioinformatics/btp324]

38. McKenna, A.; Hanna, M.; Banks, E.; Sivachenko, A.; Cibulskis, K.; Kernytsky, A.; Garimella, K.; Altshuler, D.; Gabriel, S.; Daly, M. et al. The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome. Res.; 2010; 20, pp. 1297-1303. [DOI: https://dx.doi.org/10.1101/gr.107524.110]

39. Cingolani, P.; Platts, A.; Wang, L.L.; Coon, M.; Nguyen, T.; Wang, L.; Land, S.J.; Lu, X.; Ruden, D.M. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly; 2012; 6, pp. 80-92. [DOI: https://dx.doi.org/10.4161/fly.19695]

40. Danecek, P.; Auton, A.; Abecasis, G.; Albers, C.A.; Banks, E.; DePristo, M.A.; Handsaker, R.E.; Lunter, G.; Marth, G.T.; Sherry, S.T. et al. The variant call format and VCFtools. Bioinformatics; 2011; 27, pp. 2156-2158. [DOI: https://dx.doi.org/10.1093/bioinformatics/btr330]

41. Purcell, S.; Neale, B.; Todd-Brown, K.; Thomas, L.; Ferreira, M.A.; Bender, D.; Maller, J.; Sklar, P.; de Bakker, P.I.; Daly, M.J. et al. PLINK: A tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet.; 2007; 81, pp. 559-575. [DOI: https://dx.doi.org/10.1086/519795] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/17701901]

42. Alexander, D.H.; Novembre, J.; Lange, K. Fast model-based estimation of ancestry in unrelated individuals. Genome. Res.; 2009; 19, pp. 1655-1664. [DOI: https://dx.doi.org/10.1101/gr.094052.109] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/19648217]

43. Hubisz, M.J.; Falush, D.; Stephens, M.; Pritchard, J.K. Inferring weak population structure with the assistance of sample group information. Mol. Ecol. Resour.; 2009; 9, pp. 1322-1332. [DOI: https://dx.doi.org/10.1111/j.1755-0998.2009.02591.x]

44. Chen, Z.L.; Meng, J.M.; Cao, Y.; Yin, J.L.; Fang, R.Q.; Fan, S.B.; Liu, C.; Zeng, W.F.; Ding, Y.H.; Tan, D. et al. A high-speed search engine pLink 2 with systematic evaluation for proteome-scale identification of cross-linked peptides. Nat. Commun.; 2019; 10, 3404. [DOI: https://dx.doi.org/10.1038/s41467-019-11337-z]

45. Saito, Y.A. The role of genetics in IBS. Gastroenterol. Clin. N. Am.; 2011; 40, pp. 45-67. [DOI: https://dx.doi.org/10.1016/j.gtc.2010.12.011]

46. Endelman, J.B. Ridge Regression and Other Kernels for Genomic Selection with R Package rrBLUP. Plant Genome.; 2011; 4, pp. 250-255. [DOI: https://dx.doi.org/10.3835/plantgenome2011.08.0024]

47. Kuhn, M. Building Predictive Models in R Using the caret Package. J. Stat. Softw.; 2008; 28, 26. [DOI: https://dx.doi.org/10.18637/jss.v028.i05]

48. Manavalan, B.; Subramaniyam, S.; Shin, T.H.; Kim, M.O.; Lee, G. Machine-learning-based prediction of cell-penetrating peptides and their uptake efficiency with improved accuracy. J. Proteome. Res.; 2018; 17, pp. 2715-2726. [DOI: https://dx.doi.org/10.1021/acs.jproteome.8b00148]

49. Boopathi, V.; Subramaniyam, S.; Malik, A.; Lee, G.; Manavalan, B.; Yang, D.-C. mACPpred: A Support Vector Machine-Based Meta-Predictor for Identification of Anticancer Peptides. Int. J. Mol. Sci.; 2019; 20, 1964. [DOI: https://dx.doi.org/10.3390/ijms20081964]

50. Sachs, M.C. plotROC: A Tool for Plotting ROC Curves. J. Stat. Softw.; 2017; 79, 19. [DOI: https://dx.doi.org/10.18637/jss.v079.c02]

51. Noh, E.S.; Subramaniyam, S.; Cho, S.; Kim, Y.-O.; Park, C.-J.; Lee, J.-H.; Nam, B.-H.; Shin, Y. Genotyping of Haliotis discus hannai and machine learning models to predict the heat resistant phenotype based on genotype. Front. Genet.; 2023; 14, 1151427. [DOI: https://dx.doi.org/10.3389/fgene.2023.1151427] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/37065481]

52. Loh, P.R.; Danecek, P.; Palamara, P.F.; Fuchsberger, C.; Yakir, A.R.; Hilary, K.F.; Schoenherr, S.; Forer, L.; McCarthy, S.; Abecasis, G.R. et al. Reference-based phasing using the Haplotype Reference Consortium panel. Nat. Genet.; 2016; 48, pp. 1443-1448. [DOI: https://dx.doi.org/10.1038/ng.3679] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/27694958]

53. Dimitromanolakis, A.; Xu, J.; Krol, A.; Briollais, L. sim1000G: A user-friendly genetic variant simulator in R for unrelated individuals and family-based designs. BMC Bioinform.; 2019; 20, 26. [DOI: https://dx.doi.org/10.1186/s12859-019-2611-1] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/30646839]

54. Montana, G. HapSim: A simulation tool for generating haplotype data with pre-specified allele frequencies and LD coefficients. Bioinformatics; 2005; 21, pp. 4309-4311. [DOI: https://dx.doi.org/10.1093/bioinformatics/bti689]

55. Amir, A.; Sharangi, A.B.; Bal, S.; Upadhyay, T.K.; Khan, M.S.; Ahmad, I.; Alabdallah, N.M.; Saeed, M.; Thapa, U. Genetic Variability and Diversity in Red Onion (Allium cepa L.) Genotypes: Elucidating Morpho-Horticultural and Quality Perspectives. Horticulturae; 2023; 9, 1005. [DOI: https://dx.doi.org/10.3390/horticulturae9091005]

56. Chat, V.; Ferguson, R.; Morales, L.; Kirchhoff, T. Ultra Low-Coverage Whole-Genome Sequencing as an Alternative to Genotyping Arrays in Genome-Wide Association Studies. Front. Genet.; 2022; 12, 790445. [DOI: https://dx.doi.org/10.3389/fgene.2021.790445]

57. Piles, M.; Bergsma, R.; Gianola, D.; Gilbert, H.; Tusell, L. Feature Selection Stability and Accuracy of Prediction Models for Genomic Prediction of Residual Feed Intake in Pigs Using Machine Learning. Front. Genet.; 2021; 12, 611506. [DOI: https://dx.doi.org/10.3389/fgene.2021.611506]

58. Shahsavari, M.; Mohammadi, V.; Alizadeh, B.; Alizadeh, H. Application of machine learning algorithms and feature selection in rapeseed (Brassica napus L.) breeding for seed yield. Plant Methods; 2023; 19, 57. [DOI: https://dx.doi.org/10.1186/s13007-023-01035-9]

59. Montesinos-López, O.A.; Kismiantini,; Montesinos-López, A. Two simple methods to improve the accuracy of the genomic selection methodology. BMC Genom.; 2023; 24, 220. [DOI: https://dx.doi.org/10.1186/s12864-023-09294-5]

60. Kang, M.J.; Shin, A.Y.; Shin, Y.; Lee, S.A.; Lee, H.R.; Kim, T.D.; Choi, M.; Koo, N.; Kim, Y.M.; Kyeong, D. et al. Identification of transcriptome-wide, nut weight-associated SNPs in Castanea crenata. Sci. Rep.; 2019; 9, 13161. [DOI: https://dx.doi.org/10.1038/s41598-019-49618-8]

61. Yu, G.E.; Shin, Y.; Subramaniyam, S.; Kang, S.H.; Lee, S.M.; Cho, C.; Lee, S.S.; Kim, C.K. Machine learning, transcriptome, and genotyping chip analyses provide insights into SNP markers identifying flower color in Platycodon grandiflorus. Sci. Rep.; 2021; 11, 8019. [DOI: https://dx.doi.org/10.1038/s41598-021-87281-0]

Word count: 7686

Show less

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

Onions (Allium cepa L.) are a globally significant horticultural crop, ranking second only to tomatoes in terms of cultivation and consumption. However, due to the crop’s complex genome structure, lengthy growth cycle, self-incompatibility, and susceptibility to disease, onion breeding is challenging. To address these issues, we implemented digital breeding techniques utilizing genomic data from 98 elite onion lines. We identified 51,499 high-quality variants and employed these data to construct a genomic estimated breeding value (GEBV) model and apply machine learning methods for bulb weight prediction. Validation with 260 new individuals revealed that the machine learning model achieved an accuracy of 83.2% and required only thirty-nine SNPs. Subsequent in silico crossbreeding simulations indicated that offspring from the top 5% of elite lines exhibited the highest bulb weights, aligning with traditional phenotypic selection methods. This approach demonstrates that early-stage selection based on genotypic information followed by crossbreeding can achieve economically viable breeding results. This methodology is not restricted to bulb weight and can be applied to various horticultural traits, significantly improving the efficiency of onion breeding through advanced digital technologies. The integration of genomic data, machine learning, and computer simulations provides a powerful framework for data-driven breeding strategies, accelerating the development of superior onion varieties to meet global demand.

Details

Title

Genotype-Driven Phenotype Prediction in Onion Breeding: Machine Learning Models for Enhanced Bulb Weight Selection

Author

Choi, Junhwa¹; Cho, Sunghyun²

; Choi, Subin³; Jung, Myunghee²

; Yu-jin, Lim²; Lee, Eunchae²; Lim, Jaewon²; Han Yong Park³; Shin, Younhee²

¹ Institute of Breeding Research, MIRACLE Co., Ltd., Jeju 63022, Republic of Korea
² Research and Development Center, Insilicogen Inc., 13, Yongin-si 16954, Republic of Korea
³ Department of Bioresource Engineering, Sejong University, Seoul 05006, Republic of Korea

First page

2239

Publication year

2024

Publication date

2024

Publisher

MDPI AG

e-ISSN

20770472

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.3390/agriculture14122239

ProQuest document ID

3149497097