Hyper-seq Technology and Genome-Wide Selection

Full text

Turn on search term navigation

1. Introduction

Soybeans (Glycine max (L.) Merr.) are an important food crop that originated in China, which was also the first country to domesticate and grow this crop [1,2]. Although China has made some achievements in soybean breeding, its soybean self-sufficiency rate is still lower than that of other developed countries, and it mainly relies on imports [3,4]. To fully harness the potential of soybean breeding and elevate its standards is an imperative objective that breeders must strive to achieve. Among the many traits in soybean breeding, the number of main stem nodes plays a crucial role in increasing yield and quality. The importance of the number of main stem nodes lies in its direct correlation with the plant architecture, which affects the number of branches and inflorescences, ultimately influencing the number of pods per plant and the total yield. Consequently, examining the quantity of main stem nodes is crucial for attaining enhanced productivity and advancing towards increased self-sufficiency in soybean production.

Innovation in breeding technology has been primarily driven by significant advancements in basic theory throughout the evolutionary path. At different historical stages, key technological tools in breeding have become an important way to measure the progress of this practice [5]. The current evolution of plant breeding can be divided into three main stages. The first stage is the primitive domestication stage, also known as the breeding 1.0 stage, in which wild plants were gradually domesticated into cultivated plants and further selected for superior germplasm, mainly through artificial selection, using people’s experience [6]. The second stage is known as the conventional breeding stage, or the breeding 2.0 stage, during which the fundamental theoretical framework of breeding started to take shape despite certain drawbacks such the costly field, lengthy breeding cycle, and low genetic improvement effectiveness [6,7]. Breeding 3.0 is the third stage, often known as the molecular breeding stage, where breeding techniques advanced due to the advent of genetic markers and the creation of genotype databases. An example is the advancement from the use of molecular marker-assisted backcrossing to lineage confirmation and later to linkage mapping for complex trait analysis [8]. In 2018, Wallace et al. [5], while summarizing the development from Breeding 1.0 stage to Breeding 3.0 stage, proposed the reference of Breeding 4.0, which is considered to be imminent. The core of Breeding 4.0 is a multidisciplinary and interdisciplinary applied technology system. Currently, China mainly adopts traditional soybean breeding methods, but with the rapid development of biological sciences, the problems of long cycle times and the low efficiency of traditional breeding are becoming more and more prominent, and traditional methods may not be able to meet the needs of industrial development. However, traditional soybean breeding and molecular plant breeding are not mutually exclusive, but rather can complement each other. Traditional breeding methods can provide basic data and validation for molecular plant breeding, while molecular techniques can improve the efficiency and precision of traditional breeding.

Currently, one of the most advanced breeding technologies is genomic selection (GS), a novel method of selective breeding that makes use of high-density genetic markers that span the entire genome [9,10]. By building a prediction model for early individual prediction and selection based on genomic estimated breeding value, this technique can dramatically increase breeding efficiency and shorten the breeding cycle, offering strong technical support for contemporary breeding [11,12,13,14]. In maize [15], wheat [16], soybean [17], barley [18], oats [19], corn haploid breeding [20], cross-breeding [21,22], and doubled-haploid breeding, researchers have carried out GS experiments. However, genotyping a large number of samples comes at a great expense because of the size of the soybean genome. Thus, it is crucial to use a low-cost, effective, adaptable, and high-throughput DNA sequencing library preparation and genotyping method in order to cut down on the time and expense associated with library architecture and sequencing. Hyper-seq technology [23,24] is a very affordable, effective, adaptable, and high-throughput technique for genotyping and the creation of DNA sequencing libraries. The technology has a certain gene area enrichment effect and is scalable and widely applicable. It is possible to achieve the simultaneous construction of libraries for a large number of samples and generate massive genotypic data by using different Hyper-seq primers and flexibly adjusting the marker density, as well as by using special PCR methods without additional digestion and ligation junctions. This can meet the needs of large-scale genotyping and sequencing of various species at a low cost. Using a new, economical, automated, and flexible polymerase chain reaction, Hyper-seq technology carries out library creation, sequencing, and genotyping, lowering the cost per sample by about 50%. When used in large-scale genotyping, Hyper-seq enhances sequencing efficiency and offers a viable high-throughput genotyping alternative for a variety of breeders and researchers, particularly when resources are few.

Differences in prediction accuracy among different statistical models are mainly due to different model assumptions about the distribution of marker effects [25]. Statistical models are important in GS research, and their effectiveness in training on phenotypic and genotypic data determines whether marker effects are accurately estimated, which in turn has an impact on subsequent breeding programs. Many studies have proposed new models and compared different models [26,27,28]. Many of these models put forward different a priori distributional assumptions for the distribution of marker effects, and the statistical models can be divided into two types of parametric and nonparametric control, whose ultimate purpose is to be able to realize effective dimensionality reduction and obtain accurate estimation of marker effects. While general statistical models can effectively capture additive genetic effects, populations containing heterozygous individuals or multi-year multipoint trials need to be analyzed, including dominance effects, epistatic effects, and the effects of genotype by non-additive or non-genetic effects, including dominance effects and the effects of genotype by environment interaction. These are difficult to be efficiently captured by general models. To achieve higher prediction accuracy, we should select appropriate models that accurately estimate marker effects. A variety of GS prediction models are available. First is the Genomic Best Linear Unbiased Prediction (GBLUP) [29] model, which is based on the theoretical framework of the mixed linear model and can effectively predict the breeding value of individuals by considering the kinship between markers in the genome. The advantages of the GBLUP model are that it is computationally efficient and stable when dealing with large amounts of genome data and has been widely used in breeding practice. Next is the Ridge Regression Best Linear Unbiased Prediction (RRBLUP) model, which improves the model’s ability to fit genomic data by introducing a ridge regression penalty term in the mixed model, especially when dealing with genetic markers that are highly correlated. RRBLUP is effective in reducing the problem of multiple covariances, thus improving the accuracy of the prediction. However, its performance may be limited in capturing complex genetic interactions and non-additive effects, which can affect its overall predictive power. Bayesian methods, including Bayesian_A, Bayesian_B, Bayesian_C, Bayesian_RR, Bayesian_LOOS and Bayesian_RKHS, are also employed. These models provide a flexible framework for the interpretation of genetic variation and the prediction of breeding values by introducing prior knowledge and probabilistic inference. For example, the Bayesian_A model assumes that genetic effects follow a normal distribution. The Bayesian_B model allows different markers to have different variances, thus capturing the complexity of genetic variation. The Bayesian_RR model combines the ideas of Bayesian methods and ridge regression to improve predictive stability by introducing shrinkage estimation, and the Bayesian_LOOS model evaluates the predictive ability of the model on unknown data by cross-validation. Meanwhile, the Bayesian_RKHS model utilises the theory of regenerated kernel Hilbert space, can handle nonlinear relationships, and provides a refined tool for prediction. Each model has its unique statistical assumptions and algorithmic advantages and reveals the accuracy of model prediction from different perspectives.

2. Materials and Methods

2.1. Research Materials

This study used 420 soybean plants from the Batou Base, Yazhou District, Sanya City, Hainan Province (109° E 18° N). The model prediction dataset consisted of genomic interpolation data for 420 soybeans, 420 phenotypic material data, and 153 deep resequencing data.

2.2. DNA Extraction and Hyper-seq Library Construction

A total of 420 soybean materials were used, and the basic experiments were completed with advanced Hyper-seq technology as the genotype data analysis method. Firstly, young leaves were selected for grinding to extract DNA, and the quality of DNA extraction was assessed using agarose gel electrophoresis. The DNA with satisfactory quality was diluted to the sample concentration (10 ng/µL), after which a library was constructed using Hyper-seq technology. Hyper-seq is an effective, flexible, and potentially powerful tool for marker-assisted selection and genotyping. Per sample, library preparation began with DNA, where products were amplified in 20-µL PCR reactions. Each reaction consisted of 2.0 µL of DNA, 17.0 µL of Mix, and 1.0 µL of each primer. The Hyper-seq library preparation is designed to require only a single polymerase chain reaction step. It started with an initial annealing temperature of 35 °C for the first 5 cycles, which was then increased to 55 °C for the subsequent 30 cycles. This ensures that DNA products are efficiently amplified during the first 5 cycles and consistently amplified in an exponential fashion throughout the rest of the cycles. The Hyper-seq PCR products were then pooled in equal amounts for each sample, with the capacity to include up to 384 samples in a single library.

2.3. Sequencing and Data Quality Control

The sequencing service was provided by Personal Biotechnology Company, Shanghai, China and the Sanya UW Institute of Life Sciences. The raw data were processed using the specialized tool fastp-0.23.4 [30] for quality control and filtration of next-generation sequencing (NGS) data. During the processing stage, fastp employed the following parameters for filtering: Firstly, a minimum average quality threshold was set, resulting in the discarding of reads with an average quality score below 5. Secondly, a minimum read length criterion was established, leading to the exclusion of reads that were shorter than 50 base pairs. Lastly, reads containing more than 10 undefined nucleotide bases (denoted by ‘N’) were also rejected. These configurations ensured that the final dataset maintained a high degree of quality and integrity.

2.4. SNP Identification and Filtering

The clean data were mapped to the soybean (GCA_030864155.1) reference genome, obtained from the National Center for Biotechnology Information (NCBI) using the BWA-0.7.17 [31]. Samtools-1.20 [32] were employed to remove repetitive sequences from the BAM format file. This was followed by SNP screening and detection using Gatk-4.2.6.1 [33] and vcfftools-0.1.16 [34] for data filtering. SNP sites were filtered to retain only those with a missing rate of no more than 50% and a sequencing depth of at least 3 for each site. The Beagle-5.5 [35] software was utilised to refine the genotype data and genotype-fill the Hyper-seq data, finally obtaining a complete genotype-filled dataset. For the 153 resequencing data sets and 153 Hyper-seq data sets, the same processing pipeline was applied to map them to the soybean (GCA_000004515.4) reference genome.

2.5. Establishment of Different Data Sets

Using Bcftools-1.20 [36], random screenings for SNP loci were conducted on 420 Hyper-seq datasets, successfully producing five datasets with varying numbers of SNPs: 1000_SNP.VCF, 5000_SNP.VCF, 10,000_SNP.VCF, 15,000_SNP.VCF, and 20,000_SNP.VCF. Taking the 1000_SNP.VCF dataset as an example, the creation process was as follows: First, all variant chromosome numbers and position information were extracted from the VCF file; then, these variants were randomly sorted, followed by selecting the first 1000 variants from the sorted list; finally, these 1000 variants were sorted by chromosome number and position, with the results saved as the 1000_SNP.VCF file. The creation of the other datasets (5000_SNP.VCF, 10,000_SNP.VCF, 15,000_SNP.VCF, and 20,000_SNP.VCF) followed the same procedure.

The 420 Hyper-seq data were randomly grouped to generate four datasets with different numbers of 100, 200, 300, and 400 datasets.

2.6. Phenotypic Data Analysis Methods

Phenotypic data were collected from 420 soybean materials for 3 consecutive years. Three replicates of each year’s traits were recorded to obtain a stable performance of the traits, and the 3-year data were taken as blup values for subsequent GS studies.

2.7. Genome-Wide Selection Prediction

We performed a thorough sample correspondence verification using the genomic-selection-dataset module of the Alibaba Cloud Genomic Analysis Platform https://easygene.console.aliyun.com (accessed on 14 October 2024) [37]. The accuracy and consistency of the data for upcoming analysis were guaranteed by this step. We used the platform’s genomic-selection-train module to choose the dataset produced in the preceding stage as the foundation for a new running job after finishing the sample correspondence check. According to the study criteria, we set the number of cross-validation folds to five and chose phenotype and genotype data that were pertinent to the research subjects.

3. Results and Analysis

3.1. Analysis of Phenotypic and Genotypic Data

We performed basic statistical analyses of the main stem node number traits in 420 natural populations of soybean (Table 1). All three years of data conformed to a normal distribution (Figure 1), and correlation analyses showed a positive correlation between the three years of data and the number of main stem nodes (Figure 2a). In the phenotyping, we utilised BLUP values to eliminate phenotypic differences brought about by different environments and years, eliminating the effects of these factors as much as possible. After the data preprocessing of genotypes, 104,728 high-quality single nucleotide polymorphisms (SNPs) from 420 natural populations of soybean were finally retained (Figure 2b). We randomly divided the 420 samples into two parts: one was an independent test dataset consisting of 336 (80%) randomly selected samples, and the other was a cross-validation dataset consisting of the remaining 84 (20%) samples. These two datasets were used for subsequent comparative modelling analyses.

3.2. Analysis of Different Model Prediction Results

We first referenced the genome of the Williams82 soybean (GCA_030864155.1), which yielded 104,728 SNP markers from 420 soybean natural populations. For the experiment, we randomly divided these 420 natural populations of soybean into two groups: a test set with 80% samples and a prediction set with 20% samples. We used eight models for prediction, and each model was utilised five times for cross-validation. The results were averaged (Table 2). Data analysis revealed that GBLUP, BAYES_A, BAYES_B, BAYES_LASSO and BAYES_RKHS all had prediction accuracies of more than 56%, with BAYES_A having the best prediction of 57.24%. RRBLUP had relatively poor prediction, with a difference of approximately 7% from BAYES_A, which had better results. In terms of model runtime, the BAYES models all had long runtimes, with BAYES_RKHS (9 min 3 s) having the shortest runtime that was 15 times longer than that of GBLUP (35 s). The predictions of different models based on Hyper-seq data show that the variability of the datasets generated using Hyper-seq technology in different prediction models is not significant (Figure 3). According to the actual group size and the demand for computational resources, the appropriate model can be flexibly selected for prediction through Hyper-seq technology.

3.3. Effect of Different Number of SNPs on Model Accuracy

We categorised the number of SNPs into six datasets containing 1000, 5000, 10,000, 15,000, and 20,000 SNP markers and the total number of SNPs (104,728). We analyzed each of these datasets using eight models for prediction and recorded the prediction results. The results obtained while increasing the number of SNP markers are shown in (Table 3). When the number of SNP markers reaches 15,000, its effect on the accuracy of the model gradually levels off (Figure 4). This finding indicates that increasing the number of SNP markers can improve the prediction accuracy within a certain range. When a certain number is reached, the prediction accuracy will level off.

3.4. Effect of Different Sample Sizes on Model Accuracy

We used 420 natural populations of soybean as experimental materials and employed eight predictive models for GS analysis. To assess the effect of different training set sizes on prediction accuracy, we set the gradient to gradually increase from 100 samples to 420 samples, with each gradient randomly selected and assigned to a training set (80%) or a prediction set (20%). We then performed a fivefold cross-validation and ultimately averaged the validation results to obtain a reliable estimate of predictive accuracy. Among the eight prediction models, the prediction accuracies of GBLUP, RRBLUP, Bayesian_A, Bayesian_B, Bayesian_C, Bayesian_RR, Bayesian_LOOS, and Bayesian_RKHS ranged from 0.3 to 0.6, as shown in Table 3. With the increase in the training set size, almost all the models showed an improvement in prediction accuracy to some extent. To intuitively show the prediction accuracy of the models under different training set sizes, we visualized the prediction accuracy of their main stem node number traits through a line graph, as shown in Figure 5. The average prediction accuracy of the model in the case of GBLUP gradually increased with the size of the training set from 0.3421 to 0.5605. This finding suggests that increasing the size of the training set improves the prediction accuracy of the model, leading to an accurate prediction of the number of nodes of the main stems of soybeans as a trait.

3.5. Impact of Different Sequencing Types on Model Accuracy

From the 420 natural populations of soybean used as experimental materials, we selected 153 whole genome resequencing data with 153 Hyper-seq data. Eight models were analyzed and compared to assess the differences in prediction accuracy across datasets. By comparing the prediction accuracy of different models on different datasets (Table 4), we found that the difference in prediction accuracy between Hyper-seq data and resequencing data in GBLUP, Bayesian_A, Bayesian_B, Bayesian_C, Bayesian_RR, or Bayesian_LOOS models is between 0.01 and 0.03. Meanwhile, the accuracy differences in the RRBLUP and Bayesian_RKHS models range from 0.04 to 0.05. To intuitively show the prediction accuracy of Hyper-seq data versus resequencing data under different models, we visualised the prediction accuracy of the number of nodes trait of their main stems with bar charts (Figure 6). These results indicate that the difference in predictive accuracy between Hyper-seq data and resequencing data is not significant.

4. Discussion

With their growing use and significance in crop breeding, GS tactics are essential to the advancement of crop yield, quality, and resistance, as well as the creation of future breeding technologies. According to Xu et al. [38], GS is the primary force behind the technological pillar for bringing about the new era of breeding. By assessing genetic variation directly, this method gives breeders a quick and accurate way to screen for and enhance desired traits. In this work, we conducted a comprehensive comparison analysis of eight different prediction models using Hyper-seq data for the number of nodes in the main stems of soybeans, an important agronomic trait. These models included GBLUP, BAYES, and RRBLUP. We discovered that GBLUP and BAYES approaches have a comparatively high prediction accuracy by comparing the prediction accuracy of these models. While RRBLUP performs somewhat worse in terms of prediction accuracy, GBLUP exceeds BAYES in terms of computing efficiency. The model prediction results based on the Hyper-seq dataset show that Bayes A has the highest accuracy, GBLUP has the fastest efficiency, and RRBLUP has the worst performance, which may be due to the fact that the number of main stem nodes in soybeans may be influenced by various genetic and environmental factors, and there may be complex interactions among these factors. The RRBLUP model, based on linear assumptions, may not accurately capture these nonlinear relationships and interactions, leading to inaccurate predictions. Additionally, the genetic structure of soybeans may be very complex, including additive, dominant, and epistatic effects. RRBLUP mainly considers additive effects and may ignore other types of genetic effects, which limits the model’s predictive power. The BAYES_A model allows each marker to have a different effect size, meaning it can better capture the genetic diversity of the number of main stem nodes in soybeans, including a few key genes that may have larger effects. BAYES_A is also more capable of integrating information on non-additive effects (such as dominance and epistasis), which is very important for describing the genetic variation of the number of main stem nodes in soybeans. Although GBLUP is also based on a linear model, in the case of a very large breeding population and effective computational resources, if GBLUP can provide sufficient predictive accuracy, it remains a valid choice. It also meets the time and economic cost requirements of researching. Therefore, when selecting a model, one should consider the specific objectives of researching, the characteristics of the data, and the available computational resources.

In this study, we found that an increase in the number of SNPs is not always proportional to an increase in model prediction accuracy. Specifically, based on the Hyper-seq data, the prediction accuracy of different whole genome selection models begins to show a smooth trend when the number of SNPs increases to approximately 15,000; i.e., continuing to increase the number of SNPs does not significantly improve the predictive power of the models. This finding is consistent with the theoretical predictions of Meuwissen et al. [39], who proposed equilibrium between marker density and population size. Beyond the optimal density of markers, increasing the number of markers has a limited effect on improving prediction accuracy. This phenomenon may be due to the structure of linkage disequilibrium between genetic markers, where additional markers may not provide new genetic information once the marker density reaches a certain level. A related study showed that in wheat populations, the accuracy of the model stabilized at 128 markers and no longer increased when using 485 markers for genomic prediction. By contrast, in maize, at least 800 markers were needed to achieve a similar stabilization period [40]. These findings illustrate the differences in the number of markers required by different species and datasets. In practical genomic selection applications, we need to consider multiple factors to develop effective strategies. The results of this study emphasise the importance of the rational selection and utilisation of SNP markers in GS and provide directions for future research on how to achieve optimal allocation of resources while ensuring prediction accuracy. Furthermore, population size has a significant effect on the accuracy of GS predictions. Large training populations are favourable for model accuracy estimation [41]. We used eight different models with fivefold cross-validation and set the sample gradients to 100, 200, 300, and 420. Taking the GBLUP model as an example, the accuracy of the model reached 0.3421, 0.4298, 0.4774, and 0.5605 as the sample size increased from 100 to 420, showing an upward trend. This finding further confirms the important effect of sample size on prediction accuracy. Therefore, when resources permit, breeders should increase the size of training populations as much as possible to improve the efficiency and reliability of genomic predictions.

Through a comparative analysis of eight models over 153 Hyper-seq and resequencing datasets, we discovered that Hyper-seq technology offers certain benefits for GS breeding applications. Specifically, the difference in the model’s prediction accuracy between Hyper-seq data and resequencing data was extremely small, fluctuating only between 0.01 and 0.05. This result further confirms the great potential of Hyper-seq technology in GS breeding. However, the average prediction accuracy of Hyper-Seq data across all models was slightly lower than that of whole-genome resequencing data. This indicates that, although Hyper-Seq technology may provide higher-resolution genotype data, this does not necessarily translate into improved prediction performance in all models. Whole-genome resequencing data has shown higher accuracy in most models, which may be related to its ability to capture more genetic variation. Different data types may require specific models to maximize prediction accuracy. For instance, Hyper-Seq data may contain more noise, necessitating models that can accommodate these characteristics. Whole-genome resequencing data, due to its comprehensiveness and depth, may be more easily compatible with various models, thus maintaining higher prediction accuracy across different models. Hyper-seq technology is a revolutionary DNA sequencing library preparation and genotyping method characterised by its extremely low cost and high efficiency, flexibility, and throughput. Its cost advantage is particularly significant when processing samples with a genome size of approximately 1 gigabase (Gb), reducing costs by approximately 50%. This economic advantage makes Hyper-seq an efficient tool for processing large-scale breeding data. It has also automated the entire process from DNA extraction to library construction, significantly reducing the complexity and time cost of experimental operations and dramatically improving the accuracy and repeatability of experimental data. The introduction of automated processes provides a stable and reliable experimental support for GS breeding, thus enhancing the efficiency and credibility of the entire breeding process.

In conclusion, our findings show that Hyper-seq technology performs exceptionally well in GS breeding and offer a strong theoretical underpinning and technical backing for its eventual use in breeding practices. Currently, big data mining technology is complex and limited in its versatility, and genotype detection technology is expensive and time-consuming. In contrast, the automated, high-efficiency, and low-cost movement of Hyper-seq technology predicts its wide range of application prospects in fostering the growth of the contemporary seed industry, speeding up the breeding process, and increasing breeding efficiency. As a result, Hyper-seq technology is anticipated to play a significant role in genetic engineering breeding and support the long-term growth of global agriculture and food security.

5. Conclusions

This study indicates that the Hyper-seq technology is suitable for genomic selection (GS) breeding, and the Hyper-seq model constructed with 15,000 SNP markers demonstrates stability. As an innovative technology for DNA sequencing library preparation and genotyping, it boasts low cost, high efficiency, and high throughput, showing great potential in the field of breeding. The predictive accuracy of the Hyper-seq model is close to that of whole-genome resequencing data, further validating its effectiveness. Currently, the technology has achieved full-process automation from DNA extraction to library construction, which not only reduces costs and shortens time, but also enhances the accuracy and reproducibility of experiments.

Author Contributions

Conceptualization, M.Z.; Methodology, Q.W. and M.H.; Software, Q.W. and M.H.; Validation, Q.W., M.H., R.X., S.P., L.Y., Y.X., X.L. and M.Z.; Formal analysis, Q.W., M.H., Y.Z., R.X., and T.L.; Investigation, Q.W., Y.Z., R.X., T.L., J.C., L.Y., Y.X., X.L. and M.Z.; Resources, Y.Z. and S.P.; Data curation, Q.W., M.H., Y.Z., R.X., T.L., S.P., J.C., L.Y., Y.X. and X.L.; Writing—original draft, Q.W. and J.C.; Writing—review & editing, M.H. and M.Z.; Visualization, Q.W.; Supervision, M.H. and M.Z.; Project administration, H.L., Z.X. and M.Z.; Funding acquisition, H.L., Z.X. and M.Z. All authors have read and agreed to the published version of the manuscript.

Data Availability Statement

The data reported in this paper have been stored in the National GeneBank Life Big Data Platform (CNGBdb) with the accession number CNP0006515 and can be accessed at https://db.cngb.org/cnsa/ (accessed on 18 January 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Figures and Tables

View Image - Figure 1. Histogram of phenotypic data for the number of main stem nodes over three years (a) Number of main stem nodes in 2021 (b) Number of main stem nodes in 2022 (c) Number of main stem nodes in 2023.

Figure 1. Histogram of phenotypic data for the number of main stem nodes over three years (a) Number of main stem nodes in 2021 (b) Number of main stem nodes in 2022 (c) Number of main stem nodes in 2023.

Figure 2. (a) Correlation of phenotypic data for the number of main stem nodes over three years (b) SNP density distribution.

Figure 3. Prediction accuracy of the model.

Figure 4. Predictive accuracy of models with different numbers of SNPs.

View Image - Figure 5. Predictive accuracy of the models under sample gradients of 100, 200, 300, and 420 (a) GBLUP, (b) RRBLU, (c) BAYES_A, (d) BAYES_B, (e) BAYES_C, (f) BAYES_LOOS, (g) BAYES_RKHS, and (h) BAYES_RR.

Figure 5. Predictive accuracy of the models under sample gradients of 100, 200, 300, and 420 (a) GBLUP, (b) RRBLU, (c) BAYES_A, (d) BAYES_B, (e) BAYES_C, (f) BAYES_LOOS, (g) BAYES_RKHS, and (h) BAYES_RR.

Figure 6. Comparison of model differences between the WGRS dataset and the Hyper-seq dataset.

Table 1

Statistical Information on Soybean Traits.

	Mean Value	Standard Deviation	Variable Coefficient	Max	Min
Number of main stem segments	10.87	3.03	27.92	26.33	4

Table 2

Comparison of the prediction accuracy of the models.

Model	GBLUP	RRBLUP	BAYES_A	BAYES_B	BAYES_C	BAYES_LASSO	BAYES_RKHS	BAYES_RR
Prediction accuracy	56.05%	51.04%	57.24%	56.46%	55.21%	56.49%	56.08%	55.21%
Time	6′21″	4 h 8′48″	4 h 55′26″	7 h 34′41″	7 h 3′24″	8 h 11″	7′22″	4 h 11′47″

Table 3

Comparative analysis of prediction accuracy for eight models based on SNP markers and sample size variations.

Number (SNP)	Sample Size	GBLUP	RRBLUP	BAYES_A	BAYES_B	BAYES_C	BAYES_LASSO	BAYES_RKHS	BAYES_RR
1000-SNP	420	0.318	0.2823	0.321	0.2881	0.3216	0.3229	0.3103	0.3178
5000-SNP	420	0.4432	0.4387	0.4245	0.4156	0.4349	0.4505	0.4232	0.4327
10,000-SNP	420	0.4638	0.4513	0.4803	0.4676	0.4727	0.4911	0.4629	0.4665
15,000-SNP	420	0.5249	0.4756	0.5065	0.5213	0.5269	0.5186	0.5091	0.5187
20,000-SNP	420	0.5323	0.4911	0.5427	0.5291	0.5561	0.5413	0.5251	0.5355
ALL-SNP	420	0.5605	0.5104	0.5724	0.5646	0.5521	0.5649	0.5608	0.5521
ALL-SNP	300	0.4774	0.4572	0.4702	0.4628	0.4846	0.4702	0.4688	0.4654
ALL-SNP	200	0.4298	0.3855	0.3815	0.4116	0.3953	0.4182	0.4004	0.3887
ALL-SNP	100	0.3421	0.2376	0.2906	0.2718	0.2391	0.3408	0.2888	0.2717

Table 4

Comparison of prediction accuracy of eight models under different datasets.

Data Set	GBLUP	RRBLUP	BAYES_A	BAYES_B	BAYES_C	BAYES_LASSO	BAYES_RKHS	BAYES_RR
Hyper-seq	0.3960	0.3577	0.3879	0.3943	0.3793	0.3914	0.3733	0.3749
Whole Genome Resequencing	0.4307	0.4031	0.4059	0.3987	0.3634	0.4321	0.3959	0.4195

References

1. Hartman, G.L.; West, E.D.; Herman, T.K. Crops that feed the World 2. Soybean—Worldwide production, use, and constraints caused by pathogens and pests. Food Secur.; 2011; 3, pp. 5-17. [DOI: https://dx.doi.org/10.1007/s12571-010-0108-x]

2. Tan, H.; Sun, H.X.; Kong, M.M.; Wang, J.J.; Li, C.; Li, Y.Q. The Origin of Soybean in China, Development of Breeding, and Cultivation Techniques. Mol. Plant Breed.; 2024; 4, pp. 1-9.

3. Barabaschi, D.; Tondelli, A.; Desiderio, F.; Volante, A.; Vaccino, P.; Valè, G.; Cattivelli, L. Next generation breeding. Plant Sci.; 2016; 242, pp. 3-13. [DOI: https://dx.doi.org/10.1016/j.plantsci.2015.07.010]

4. Wu, Y.; Wang, E.; Gong, W.; Xu, L.; Zhao, Z.; He, D.; Yang, F.; Wang, X.; Yong, T.; Liu, J. et al. Soybean yield variations and the potential of intercropping to increase production in China. Field Crops Res.; 2023; 291, 108771. [DOI: https://dx.doi.org/10.1016/j.fcr.2022.108771]

5. Wallace, J.G.; Rodgers-Melnick, E.; Buckler, E.S. On the road to breeding 4.0: Unraveling the good, the bad, and the boring of crop quantitative genomics. Annu. Rev. Genet.; 2018; 52, pp. 421-444. [DOI: https://dx.doi.org/10.1146/annurev-genet-120116-024846] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/30285496]

6. Jing, H.; Tian, Z.; Chong, K.; LI, J. Progress and perspective of molecular design breeding. Sci. Sin. Vitae; 2021; 51, pp. 1356-1365. [DOI: https://dx.doi.org/10.1360/SSV-2021-0214]

7. Hickey, L.T.; NHafeez, A.; Robinson, H.; Jackson, S.A.; Leal-Bertioli, S.C.; Tester, M.; Gao, C.; Godwin, I.D.; Hayes, B.J.; Wulff, B.B. Breeding crops to feed 10 billion. Nat. Biotechnol.; 2019; 37, pp. 744-754. [DOI: https://dx.doi.org/10.1038/s41587-019-0152-9]

8. Lander, E.S.; Botstein, D. Mapping mendelian factors underlying quantitative traits using RFLP linkage maps. Genetics; 1989; 121, pp. 185-199. [DOI: https://dx.doi.org/10.1093/genetics/121.1.185] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/2563713]

9. Dou, T.; Wu, J.; Wu, Z.; Bai, L.; Li, X.; Han, X.; Qiao, R.; Wang, K.; Yang, F.; Wang, Y. et al. Application of Genomic Selection and Mating Design Techniques in Pig Breeding. Acta Vet. Zootech. Sin.; 2024; 55, pp. 2795-2808.

10. Desta, Z.A.; Ortiz, R. Genomic selection: Genome-wide prediction in plant improvement. Trends Plant Sci.; 2014; 19, pp. 592-601. [DOI: https://dx.doi.org/10.1016/j.tplants.2014.05.006] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/24970707]

11. Heffner, E.L.; Lorenz, A.J.; Jannink, J.; Sorrells, M.E. Plant breeding with genomic selection: Gain per unit time and cost. Crop Sci.; 2010; 50, pp. 1681-1690. [DOI: https://dx.doi.org/10.2135/cropsci2009.11.0662]

12. Jannink, J.-L.; Lorenz, A.J.; Iwata, H. Genomic selection in plant breeding: From theory to practice. Brief. Funct. Genom.; 2010; 9, pp. 166-177. [DOI: https://dx.doi.org/10.1093/bfgp/elq001]

13. Millet, E.J.; Kruijer, W.; Coupel-Ledru, A.; Prado, S.A.; Cabrera-Bosquet, L.; Lacube, S.; Charcosset, A.; Welcker, C.; van Eeuwijk, F.; Tardieu, F. Genomic prediction of maize yield across European environmental conditions. Nat. Genet.; 2019; 51, pp. 952-956. [DOI: https://dx.doi.org/10.1038/s41588-019-0414-y] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/31110353]

14. Yu, X.; Li, X.; Guo, T.; Zhu, C.; Wu, Y.; Mitchell, S.E.; Roozeboom, K.L.; Wang, D.; Wang, M.L.; Pederson, G.A. et al. Genomic prediction contributing to a promising global strategy to turbocharge gene banks. Nat. Plants; 2016; 2, 16150. [DOI: https://dx.doi.org/10.1038/nplants.2016.150]

15. Bernardo, R.; Yu, J. Prospects for genomewide selection for quantitative traits in maize. Crop Sci.; 2007; 47, pp. 1082-1090. [DOI: https://dx.doi.org/10.2135/cropsci2006.11.0690]

16. Heffner, E.L.; Jannink, J.; Iwata, H.; Souza, E.; Sorrells, M.E. Genomic selection accuracy for grain quality traits in biparental wheat populations. Crop Sci.; 2011; 51, pp. 2597-2606. [DOI: https://dx.doi.org/10.2135/cropsci2011.05.0253]

17. Stewart-Brown, B.B.; Song, Q.; Vaughn, J.N.; Li, Z. Genomic selection for yield and seed composition traits within an applied soybean breeding program. G3 Genes|Genomes|Genet.; 2019; 9, pp. 2253-2265. [DOI: https://dx.doi.org/10.1534/g3.118.200917] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/31088906]

18. Lorenz, A.J.; Smith, K.; Jannink, J. Potential and optimization of genomic selection for fusarium head blight resistance in six-row barley. Crop Sci.; 2012; 52, pp. 1609-1621. [DOI: https://dx.doi.org/10.2135/cropsci2011.09.0503]

19. Asoro, F.G.; Newell, M.A.; Beavis, W.D.; Scott, M.P.; Tinker, N.A.; Jannink, J.L. Genomic, marker-assisted, and pedigree-BLUP selection methods for β-glucan concentration in elite oat. Crop Sci.; 2013; 53, pp. 1894-1906. [DOI: https://dx.doi.org/10.2135/cropsci2012.09.0526]

20. Xu, L.Y.; Wang, L.M.; Ren, Z.P.; She, N.A.; Wang, L.M.; Liu, F.; Ma, X.C.; Li, H.L.; Cui, M.L. Current development and prospects for efficient application of maize haploid breeding technology. China Seed Ind.; 2024; 9, pp. 33-36.

21. Larièpe, A.; Moreau, L.; Laborde, J.; Bauland, C.; Mezmouk, S.; Décousset, L.; Mary-Huard, T.; Fiévet, J.B.; Gallais, A.; Dubreuil, P. et al. General and specific combining abilities in a maize (Zea mays L.) test-cross hybrid panel: Relative importance of population structure and genetic divergence between parents. Theor. Appl. Genet.; 2017; 130, pp. 403-417. [DOI: https://dx.doi.org/10.1007/s00122-016-2822-z]

22. Xiao, Y.; Jiang, S.; Cheng, Q.; Wang, X.; Yan, J.; Zhang, R.; Qiao, F.; Ma, C.; Luo, J.; Li, W. et al. The genetic mechanism of heterosis utilization in maize improvement. Genome Biol.; 2021; 22, 148. [DOI: https://dx.doi.org/10.1186/s13059-021-02370-7] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/33971930]

23. Zou, M.; Xia, Z. Hyper-seq: A novel, effective, and flexible marker-assisted selection and genotyping approach. Innovation; 2022; 3, 100254. [DOI: https://dx.doi.org/10.1016/j.xinn.2022.100254] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/35602119]

24. Zhang, C.; Jiang, S.; Tian, Y.; Dong, X.; Xiao, J.; Lu, Y.; Liang, T.; Zhou, H.; Xu, D.; Zhang, H. et al. Smart breeding driven by advances in sequencing technology. Mod. Agric.; 2023; 1, pp. 43-56. [DOI: https://dx.doi.org/10.1002/moda.8]

25. Habier, D.; Fernando, R.L.; Kizilkaya, K.; Garrick, D.J. Extension of the bayesian alphabet for genomic selection. BMC Bioinform.; 2011; 12, 186. [DOI: https://dx.doi.org/10.1186/1471-2105-12-186] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/21605355]

26. Cuevas, J.; Pérez-Elizalde, S.; Soberanis, V.; Pérez-Rodríguez, P.; Gianola, D.; Crossa, J. Bayesian genomic-enabled prediction as an inverse problem. G3 Genes|Genomes|Genet.; 2014; 4, pp. 1991-2001. [DOI: https://dx.doi.org/10.1534/g3.114.013094] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/25155273]

27. Ceron-Rojas, J.J.; Crossa, J.; Arief, V.N.; Basford, K.; Rutkoski, J.; Jarquín, D.; Alvarado, G.; Beyene, Y.; Semagn, K.; DeLacy, I. A Genomic selection index applied to simulated and real data. G3 Genes|Genomes|Genet.; 2015; 5, pp. 2155-2164. [DOI: https://dx.doi.org/10.1534/g3.115.019869] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/26290571]

28. LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature; 2015; 521, pp. 436-444. [DOI: https://dx.doi.org/10.1038/nature14539] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/26017442]

29. VanRaden, P. Efficient methods to compute genomic predictions. J. Dairy Sci.; 2008; 91, pp. 4414-4423. [DOI: https://dx.doi.org/10.3168/jds.2007-0980] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/18946147]

30. Chen, S.; Zhou, Y.; Chen, Y.; Gu, J. fastp: An ultra-fast all-in-one FASTQ preprocessor. Bioinformatics; 2018; 34, pp. i884-i890. [DOI: https://dx.doi.org/10.1093/bioinformatics/bty560] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/30423086]

31. Li, H.; Durbin, R. Fast and accurate short read alignment with Burrows—Wheeler transform. Bioinformatics; 2009; 25, pp. 1754-1760. [DOI: https://dx.doi.org/10.1093/bioinformatics/btp324]

32. Li, H.; Handsaker, B.; Wysoker, A.; Fennell, T.; Ruan, J.; Homer, N. The sequence alignment/map format and SAMtools. Bioinformatics; 2009; 25, pp. 2078-2079. [DOI: https://dx.doi.org/10.1093/bioinformatics/btp352] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/19505943]

33. McKenna, A.; Hanna, M.; Banks, E.; Sivachenko, A.; Cibulskis, K.; Kernytsky, A.; Garimella, K.; Altshuler, D.; Gabriel, S.; Daly, M. et al. The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res.; 2010; 20, pp. 1297-1303. [DOI: https://dx.doi.org/10.1101/gr.107524.110]

34. Danecek, P.; Auton, A.; Abecasis, G.; Albers, C.A.; Banks, E.; DePristo, M.A.; Handsaker, R.E.; Lunter, G.; Marth, G.T.; Sherry, S.T. et al. The variant call format and VCFtools. Bioinformatics; 2011; 27, pp. 2156-2158. [DOI: https://dx.doi.org/10.1093/bioinformatics/btr330] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/21653522]

35. Browning, B.L.; Zhou, Y.; Browning, S.R. A one-penny imputed genome from next-generation reference panels. Am. J. Hum. Genet.; 2018; 103, pp. 338-348. [DOI: https://dx.doi.org/10.1016/j.ajhg.2018.07.015] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/30100085]

36. Danecek, P.; Bonfield, J.K.; Liddle, J.; Marshall, J.; Ohan, V.; Pollard, M.O.; Whitwham, A.; Keane, T.; McCarthy, S.A.; Davies, R.M. et al. Twelve years of SAMtools and BCFtools. GigaScience; 2021; 10, giab008. [DOI: https://dx.doi.org/10.1093/gigascience/giab008] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/33590861]

37. Li, H.; Li, X.; Zhang, P.; Feng, Y.; Mi, J.; Gao, S.; Sheng, L.; Ali, M.; Yang, Z.; Li, L. et al. Smart Breeding Platform: A web-based tool for high-throughput population genetics, phenomics, and genomic selection. Mol. Plant; 2024; 17, pp. 677-681. [DOI: https://dx.doi.org/10.1016/j.molp.2024.03.002]

38. Xu, Y.; Liu, X.; Fu, J.; Wang, H.; Wang, J.; Huang, C.; Prasanna, B.M.; Olsen, M.S.; Wang, G.; Zhang, A. Enhancing genetic gain through genomic selection: From livestock to plants. Plant Commun.; 2020; 1, 100005. [DOI: https://dx.doi.org/10.1016/j.xplc.2019.100005] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/33404534]

39. Meuwissen, T.H.E.; Hayes, B.J.; Goddard, M.E. Prediction of total genetic value using genome-wide dense marker maps. Genetics; 2001; 157, pp. 1819-1829. [DOI: https://dx.doi.org/10.1093/genetics/157.4.1819]

40. Heslot, N.; Yang, H.P.; Sorrells, M.E.; Jannink, J.L. Genomic selection in plant breeding: A comparison of models. Crop Sci.; 2012; 52, pp. 146-160. [DOI: https://dx.doi.org/10.2135/cropsci2011.06.0297]

41. Wang, X.; Xu, Y.Y.; Xu, Y.; Xu, C.W. Research Progress in Genomic Selection Breeding Technology for Crops. Biotechnol. Bull.; 2024; 40, pp. 1-13.

Word count: 6143

Show less

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

Soybeans (Glycine max (L.) Merr.) are a multifunctional crop that contributes significantly to global food security, economic development, and agricultural sustainability. Genomic selection (GS) is widely used in plant breeding, which can effectively reduce breeding costs and shorten the breeding cycle compared to traditional breeding methods. In this study, Hyper-seq technology was used to gather data on 104,728 single nucleotide polymorphism (SNP) sites from 420 natural populations of soybean that were chosen as experimental materials. Furthermore, three years’ worth of phenotypic data on the population’s main stem node count were gathered for this investigation. Comparative analysis was used to assess the validity and accuracy of a number of GS models, including Ridge Regression Best Linear Unbiased Prediction (RRBLUP), Genomic Best Linear Unbiased Prediction (GBLUP), and various Bayesian techniques (Bayesian_A, Bayesian_B, Bayesian_C, Bayesian_RR, Bayesian_LOOS, and Bayesian_RKHS). Each model’s performance was compared using fivefold cross-validation. The research findings indicate that the data obtained by Hyper-seq technology is particularly useful for breeding experiments, including genome-wide selection. The most accurate of them is Bayesian_A, whereas the one with the quickest computational efficiency is GBLUP. Using Hyper-seq technology requires integrating at least 15,000 SNPs to guarantee the model’s stability. It is also important to note that, even if 153 Hyper-seq datasets are 50% less expensive than 153 Whole Genome Sequencing datasets, the difference in prediction accuracy between the two datasets is less than 4%. This discovery further validates the reliability and efficacy of Hyper-seq technology within the domain of genome-wide selection breeding.

Details

Title

Hyper-seq Technology and Genome-Wide Selection Breeding of Soybeans

Author

Wang, Qingyu; He, Miaohua; Zhou, Yonggang; Xu, Rui

; Liang, Tiyun; Shuangkang Pei; Chen, Jianyuan; Yang, Lin

; Yu, Xia; Luo, Xuan; Li, Haiyan; Xia, Zhiqiang

; Zou, Meiling

First page

264

Publication year

2025

Publication date

2025

Publisher

MDPI AG

e-ISSN

20734395

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.3390/agronomy15020264

ProQuest document ID

3170852044