Optimization of high‐throughput marker systems

Full text

Translate

Turn on search term navigation

Abbreviations

BLUE
best linear unbiased estimate

BLUP
best linear unbiased prediction

CV
cross validation

DMY
dry matter yield

DP
mean read depth

GBS
genotyping by sequencing

GEBVs
genomic estimated breeding values

GP
genomic prediction

GWFP
genome-wide family prediction

H²
broad-sense heritability

LD
linkage disequilibrium

PA
predictive ability

PCA
principal component analysis

SNP
single nucleotide polymorphism

INTRODUCTION

Alfalfa (Medicago sativa L.) is a perennial forage crop widely grown for hay, grazing, and silage production. It is among the most important forage crops due to its capacity to fix nitrogen, high forage quality, broad adaptability, and high dry matter yield (DMY) (Fernandez et al., 2019; Small, 2011). Cultivated alfalfa is autotetraploid (2n = 4X = 32), cross-pollinated by insects, and suffers severe inbreeding depression (X. Li & Brummer, 2012). Alfalfa has a relatively large and complex genome (0.8 Gb haploid size) with a high degree of heterozygosity (Shen et al., 2020). These features, among other factors, have negatively impacted breeding efforts and the development of genomic tools for alfalfa (Annicchiarico et al., 2016; Chen et al., 2020).

Typically, alfalfa cultivars are developed through phenotypic recurrent selection as synthetic populations (Julier et al., 2000; X. Li & Brummer, 2012). Breeding goals are focused on improving DMY, persistence, tolerance to abiotic and biotic stresses, and enhancing nutritive value (Hawkins & Yu, 2018; F. He et al., 2020; X. Li & Brummer, 2012). Genetic improvement for DMY has remained stagnant in recent decades, increasing at a rate of 0.5% per year (Annicchiarico, Barrett, et al., 2015; Brummer & Casler, 2014). Better performance of modern cultivars has been shown to be more related to adaptability rather than genetic improvement for DMY per se (Lamb et al., 2006). Lack of genetic gain for DMY could be explained by the lack of a clear harvest index, long breeding cycles, high genotype by environment interaction, tetrasomic inheritance, and low resources combined with high phenotyping costs (Annicchiarico, Barrett, et al.,2015; Murad Leite Andrade et al., 2022; Filho et al., 2023).

Breeding values are typically estimated with best linear unbiased prediction (BLUP) for traits with complex inheritance (Piepho et al., 2008) and used to select superior performing individuals. Best linear unbiased predictions are based on the theory of resemblance (Lynch & Walsh, 1998). Pedigree records from breeding programs can be used to compute the expected additive relationship matrix of individuals or families of a breeding program and predict the performance of untested candidates (Mrode & Thompson, 2005). In the absence of pedigree records, or when there are errors in the pedigree, genomic data may be used to estimate the realized additive relationship matrix among genotypes and perform genomic prediction (GP) (Munoz et al., 2014).

GP allows the estimation of genetic merits of unphenotyped individuals through molecular marker data (Meuwissen et al., 2001). In GP, genomic estimated breeding values (GEBVs) are predicted by fitting statistical models with data collected from a population that has been genotyped and phenotyped (the so-called training population). The performance of unphenotyped individuals (in the so-called target population) is predicted through the GEBVs predicted from the training population, using information from a marker-based relationship matrix between both populations. In this context, a factor that can affect the predictive ability (PA) and the logistic for GP implementation is the marker density (de Bem Oliveira et al., 2020; Lorenzana & Bernardo, 2009). As marker density increases, leverage on PA and cost are commonly observed, although a plateau in PA has been reported in several crops such as blueberry (de Bem Oliveira et al., 2020), wheat (Norman et al., 2018), and perennial ryegrass (Cericola et al., 2018). As a main consequence, the use of mid-density marker panels would reduce costs per sample, allow more individuals to be genotyped, and ultimately reduce computing resources and time (Gorjanc et al., 2015).

Traditionally, GEBVs of untested individuals are estimated at the individual level. Models are fitted by genotyping individuals, while phenotypic records are obtained per genotype or progenies from specific families. An alternative approach, genome-wide family prediction (GWFP), has been developed in which GEBVs are predicted based on family-bulked phenotypic and genotypic data (Rios et al., 2021). In this scenario, phenotypic values collected to train prediction models resemble commercial stands, which is performed at the family level for some crops, while genotyping costs are minimized (Fé et al., 2016; Guo et al., 2018; Rios et al., 2021). GWFP has been applied in alfalfa, and similar predictive abilities were reported in comparison with prediction models that estimated breeding values at the individual level (Murad Leite Andrade et al., 2022; Annicchiarico, Nazzicari, et al., 2015). Although the effect of marker density in GP has been reported in alfalfa (X. He et al., 2022; Wang et al., 2024), no studies have addressed the optimal number of markers for GWFP.

Advances in sequencing technology have resulted in different strategies to retrieve genotypic data in crops. Genotyping by sequencing (GBS) based on reduced representation by restriction enzymes, although having relatively high missing data rates, has been successfully used to genotype alfalfa populations for GP of DMY (Annichiaricco, Nazzicari, et al., 2015; Biazzi et al., 2017; X. He et al., 2022; X. Li et al., 2015; Medina et al., 2020; Pegard et al., 2023). A mid-density, fixed array yielding a few 1000 markers has been developed for alfalfa (X. Li et al., 2014) and could be used for GP in alfalfa as they have been proven to be effective in other species (Heffner et al., 2011; Kriaridou et al., 2020; Lorenz et al., 2012). In maize, several fixed marker arrays have been reported in the literature, offering lower costs per sample compared to GBS. However, on average, lower PA compared to GBS have been described for GP (Yu et al., 2023). Targeted enrichment sequencing (Capture-seq) also allows to reduce the complexity of genomes and yields thousands to hundreds of thousands of markers. Murad Leite Andrade et al. (2022) described the implementation of Capture-seq in family bulks to develop GP models for canopy height and yield in nondormant alfalfa. The authors first designed 30,000 120-mer biotinylated probes based on the Cultivated Alfalfa at the Diploid Level genome, and probes were further searched against a tetraploid alfalfa assembly (Dr. Maria Monteros, Noble Research Institute personal communication, 2018), and only probes that hit four or fewer times in the genome were kept (as alfalfa is an autotetraploid species). Finally, the probes were screened against the alfalfa chloroplast genome (KU321683.1) to avoid probes associated with non-nuclear DNA. Finally, 17,707 target probes were used in the bulked family DNA samples, and sequencing was performed using the Illumina HiSeq instrument and 1 × 100 reads. Recently, a targeted sequencing mid-density panel of 3000 markers has been developed for alfalfa and could be implemented for GP (Zhao et al., 2023). This panel was developed using a panel of 40 diverse alfalfa clonal genotypes from North America (17 elite parents with various fall dormancy levels, six samples of diploid-cultivated alfalfa, 13 genotypes with abiotic stress resistance, one genotype with Aphanomyces root rot disease resistance, and three other genotypes). The detailed description for the development of the panel is described in Zhao et al. (2023).

Several factors weigh in the plant breeder's choice for genotyping platforms. Cost per data point, number of yielded DNA markers, turnaround time, bioinformatics services provided by vendor, and DNA sample requirements by the vendor will dictate the choice of technology. The decision to choose a specific technology depends on the objectives and is not always clear. In this context, GP optimization plays an important role for genomic selection implementation. Our central hypothesis in this study is that by reducing marker density, we can reallocate resources by decreasing the costs of genotyping and keeping the same PA. To this end, we used a large population of alfalfa, evaluated over multiple harvests, to (i) test two targeted sequencing genotyping platforms with different marker density for DMY prediction, (ii) explore optimal levels of marker density to predict DMY in alfalfa family bulks, and (iii) compare the performance of marker-based and pedigree-based prediction models for DMY in alfalfa family bulks. Altogether, we provide a new line of evidence that genomic selection applied to forage breeding can be optimized and implemented at the pace and scope of traditional breeding designs.

Core Ideas

Alfalfa family bulks were genotyped using two targeted sequencing technologies: DArTag and Capture-seq.
2190 DArTag and 114,945 Capture-seq markers resulted in similar population structure and predictive abilities.
Predictive ability for dry matter yield (DMY) increased with marker density until reaching a plateau at 500 markers in both platforms.
A set of 500 markers provided similar predictive abilities for cumulative DMY training a model with only five harvests.
Genomic prediction outperformed pedigree-based predictions for cumulative DMY per year.

MATERIALS AND METHODS

Reference population

The nondormant alfalfa reference population was created using a diverse set of parental lines coming from 33 populations from germplasm sources from Argentina and the United States. The selection of the parental lines was based on DMY and plant persistence in a 2-year study performed in Florida (Acharya et al., 2020). A factorial mating design was used for crosses using six male parents (four parents from the UF alfalfa breeding program and two commercial cultivars from the United States), and 27 parental lines used as females from the alfalfa breeding program at INTA–Manfredi Argentina and from other germplasm sources from the United States. Besides the four male parents from the UF breeding program and some female parents developed by INTA–Manfredi, most of the parental lines used in these crosses do not share common pedigrees as they come from multiple programs and companies. In most crosses, a single plant per population (cultivar or breeding line) was used as a parental line. Details about the crosses can be found in Acharya et al. (2020). The reference population consisted of 160 alfalfa families, 134 full-sibs from controlled crosses, and 26 half-sibs by harvesting seeds from open-pollinated parental lines.

Experimental design and cultural practices

A field trial was established at the UF/IFAS Plant Science Research and Education Unit, Citra, FL (29°24′16ʺ N and 82°10′17ʺ W) from November 2017 to August 2019. The experimental design was a row and column design with an augmented representation of three controls. The controls used in the trial were the cultivars Bulldog 805, and Florida 99, and the new cultivar from the UF breeding program, UF_AlfPers_2015. Seeds from cultivars and crosses were sown in trays and grown in a greenhouse for 10 weeks. Twenty seedlings were randomly selected from the trays and transplanted in each plot in two rows of 10 plants (1.82 m × 0.12 m). A detailed description of the arrangement of the plots and the cultural practices is reported in Acharya et al. (2020).

Phenotyping

For phenotyping, DMY was measured for 11 harvest time points from April 2018 to March 2019 after an initial 6-month establishment period as described in Acharya et al. (2020). Experimental units were manually harvested when the control UF_AlfPers_2015 reached 10% blooming using a sickle. Total fresh weight was recorded. From the whole plot sample, a subsample of approximately 500 g was dried to estimate the DMY (kg ha⁻¹) per plot. The breeding population used in this study represents the first breeding cycle for recent efforts at the UF alfalfa breeding program to improve the crop's adaptation and performance in subtropical conditions (Acharya et al., 2020). Phenotypic records were collected only for 11 harvests because several families showed significant stand persistence and yield decline. These limited records prevented us from exploring genotype by environment interactions and across-year yield fluctuations. However, the dataset collected in this study does not deviate significantly from current expectations for stand longevity of alfalfa grown in Florida (Dubeux & Munoz, 2014), and 187 breeding efforts (selection and crossing) continued after this study was terminated (Dr. Rios, personal communication, 2024).

DNA extraction

One leaf was removed from 30 individuals belonging to each family and bulked for DNA extractions at the family level. To ensure uniformity, one new leaf from each individual was removed, and the entire set of 30 leaves was ground in a 1.5-mL round-bottomed tube with two metal beads in a Qiagen TissueLyser II. DNA was extracted directly from the tube with the grounded leaves. Genomic DNA was isolated using the DNeasy Plant Mini Kit (Qiagen) and quantified with a Quant-iT PicoGreen dsDNA assay kit (Life Technologies, P7589). The same DNA samples were used for genotyping using two targeted enrichment sequencing platforms: Capture-seq and DArTag.

Capture-seq genotyping

The genotyping was performed at Rapid Genomics, LCC employing Capture-seq. For this, 17,707 probes of 120-mers were used for targeted enrichment in the bulked family DNA samples, and sequencing was performed using the Illumina HiSeq instrument with 1 × 100 cycles (Murad Leite Andrade et al., 2022).

Subsequently, the raw reads were filtered and trimmed using Trimmomatic v.0.39 (Bolger et al., 2014). Filtered reads were aligned against the largest homologous set of Medicago sativa ‘XinJiangDaYe’ reference genome (Chen et al., 2020) using the BWA v.0.7.17 software (H. Li & Durbin et al., 2009). The alignment files were converted and sorted using SAMtools v.1.10 (H. Li et al., 2009). Picard v.2.21.2 () was used to add groups and remove PCR duplicates. Single nucleotide polymorphisms (SNPs) were discovered de novo in this population using FREEBAYES v.1.3.2 (Garrison & Marth et al., 2012).

SNPs were further filtered considering the following criteria: minimum base quality of 20; minimum mapping quality of 30; only biallelic locus; no monomorphic locus; maximum missing data of 10%; mean read depth (DP) value over all samples ≥30; sequencing depth for each data point (gDP) ≥4 (data points with smaller DP were set as not assessed [NA]); the ratio between the mapping quality of the alternative and the reference allele between 0.95 and 1.05; minor allele frequency >0.05. A total of 114,945 SNPs were kept after filtering.

Alternative and reference read counts for each SNP and sample were retrieved using VCFtools v.0.1.16 (Danecek et al., 2011). Genotype parametrization was performed considering the ratio between read counts of the reference allele and alternative allele. Instead of discrete genotype dosage calling, as a family bulk, genotypic data varied continuously from 0 to 1 as described in de Bem Oliveira et al. (2019, 2020).

DArTag genotyping

The DArTag genotyping was performed at Diversity Array Technologies using a 3K panel developed by Breeding Insight. Details of the development of the SNP panel are described in Zhao et al. (2023). Briefly, the DArTag assay is based on a pool of 3000 oligo probes that were hybridized against the denatured family bulked DNA samples in a first step. After downstream processes, DArTag products were sequenced on the Illumina platform to produce reads of 54 bp with a depth of around 350x per marker per sample. Genetic variants were detected using DArT's proprietary analytical pipeline. SNPs were filtered for minor allele frequency > 0.05 and no missing data, yielding a total of 2190 markers. The total allele read counts were used to estimate the allele ratio, following the same pipeline performed with the Capture-seq data. Therefore, genotypic data also varied continuously from 0 to 1.

Linkage disequilibrium and population structure analysis

Pairwise Pearson correlations (r²) between pairs of SNPs in a window of 1 Mb were performed to assess the linkage disequilibrium (LD) across the genome. For Capture-Seq, 2190 SNPs were randomly sampled to match the number of SNPs detected with DArTag, and both datasets were used in the LD analysis. The LD decay over physical distance was determined as the mean distance under the LD threshold of r² = 0.2.

The population used in this study was the result of a mating design with six male parents in combination with 27 females from different cultivars (Acharya et al., 2020). To observe if the additive genetic relationship matrix would reflect the population structure of the breeding population, we performed a principal component analysis (PCA) of the additive marker-based relationship matrix estimated from SNP marker data of each genotyping platform based on the VanRaden method from the AGHmatrix R package (Amadeu et al., 2016). PCA was performed using the R package adegenet (Jombart & Ahmed, 2011, R Core Team, 2020).

To observe the effect of reduced marker density in capturing the population structure, we performed a PCA of additive marker-based relationship matrices estimated from 10, 50, 100, 500, and 1000 markers of each genotyping platform using AGHmatrix and adegenet R packages as described before.

Phenotypic data analyses

The empirical best linear unbiased estimates (BLUE's) for DMY per harvest were estimated using the following linear model: 1 $\begin{equation}{{y}_{ijk}} = \mu + {{r}_i} + {{c}_j} + {{f}_k} + {{e}_{ijk}}\end{equation}$ where ${{y}_{ijk}}$ is the phenotypic value of the $k{\mathrm{t}}{{{\mathrm{h}}}}$ family, in the $i{\mathrm{t}}{{{\mathrm{h}}}}$ row and the $j{\mathrm{t}}{{{\mathrm{h}}}}$ column. The term $\mu$ represents the population mean; $r$ is the random effect of rows, with $r$ ∼ N(0, $I\sigma _r^2$ ); $c$ is the random effect of columns, with $c$ ∼ N(0, $I\sigma _c^2$ ); ${{f}_k}$ is the fixed effect of the $k{\mathrm{t}}{{{\mathrm{h}}}}$ family; and ${{e}_{ijk}}$ is the random effect of residual, with $e$ ∼ N(0, $\sigma _e^2$ ). Estimated BLUE's for DMY for each harvest were later used to calculate the PA of the GP models. The phenotypic correlation between the BLUE's of each family across the harvests was calculated.

To obtain the accumulated performance of each family in all 11 harvests, the following linear model was used: 2 $\begin{equation}y = {\bm{X}}b + {{{\bm{Z}}}_r}r + {{{\bm{Z}}}_c}c + \xi \end{equation}$ where $y$ is the vector of phenotypic values of the trait; $b$ is the vector of fixed effects (overall mean and families); $r$ is the random effect of rows, with $r$ ∼ N(0, $I\sigma _r^2$ ); c is the random effect of columns with $c$ ∼ N(0, $I\sigma _c^2$ ); and $\xi$ is the residual random effect, with $\xi$ ∼ N(0, ${{R}_e}\sigma _e^2$ ). ${{R}_{e\ }}$ is the first-order autoregressive covariance matrix of $\xi$ , it is defined as $\sigma _\xi ^2\sum c( {{{\rho }_c}} ) \otimes \sum r( {{{\rho }_r}} ).$ The terms ${{\rho }_c}\ and\ {{\rho }_r}$ are the autocorrelation parameters for the coordinates of columns and rows; $\sum r( {{{\rho }_c}} )$ and $\sum r( {{{\rho }_r}} )$ represent autoregressive correlation matrices while $\otimes$ is the Kronecker product (Murad Leite Andrade et al. 2022). ${\bm{X}},\ {{{\bm{Z}}}_r},\ {\mathrm{and}}\ {{{\bm{Z}}}_c}\ a$ re incidence matrices associated with the vectors $b,\ r,{\mathrm{\ and}}\ c$ . The eBLUE (empirical BLUE) of each family for DMY was estimated.

Broad-sense heritability

Broad-sense heritability (H²) was calculated per harvest following variance component estimates derived from a model similar to Equation (1) in which ${{f}_k}$ is the random effect of the $k{\mathrm{t}}{{{\mathrm{h}}}}$ family with $f$ ∼ N(0, $I\sigma _f^2$ ).

GP models

The GP analyses were conducted considering two statistical models to represent scenarios using data from a single harvest and multiple harvests.

The first scenario was a model in which GP was implemented per harvest following the linear model: 3 $\begin{equation}y = {\bm{X}}b + {{{\bm{Z}}}_r}r + {{{\bm{Z}}}_c}c + {{{\bm{Z}}}_f}f + \varepsilon \end{equation}$ in which $y$ is the vector of phenotypic values of the trait; $b$ is the vector of fixed effects, which in this case only includes the overall mean; $r$ is the random effect of rows, with $r$ ∼ N(0, $I\sigma _r^2$ ); $c$ is the random effect of columns, with $c$ ∼ N(0, $I\sigma _c^2$ ); $f$ is the random effect of families, with $f$ ∼ N(0, G $\sigma _f^2$ ) in which G is the genomic relationship matrix (described below); and $\varepsilon$ is the residual random effect, with $\varepsilon$ ∼ N(0, ${\bm{I}}\sigma _\varepsilon ^2$ ). ${\bm{X}},\ {{{\bm{Z}}}_r},\ {{{\bm{Z}}}_c},$ and ${{{\bm{Z}}}_f}$ are the incidence matrices associated with the vectors $b,\ r,\ c,$ and $f$ .

A second model accounted for the harvest effect and the interaction between families and harvest, and the following model was considered for the harvest × family scenario: 4 $\begin{equation}y = {\bm{X}}b + {{{\bm{Z}}}_r}r + {{{\bm{Z}}}_c}c + {{{\bm{Z}}}_f}f + \varepsilon \end{equation}$ in which $y$ is the vector of phenotypic values of the trait; $b$ is the vector of fixed effects (overall mean and harvests); $r$ is the random effect of rows within harvests, with $r$ ∼ N(0, $I\sigma _r^2$ ); $c$ is the random effect of columns within harvests, with $c$ ∼ N(0, $I\sigma _c^2$ ); $f$ is the random effect of families, with $f$ ∼ N(0, ${\bm{I}}_H^{J^{\prime}J}$ ⊗ G $\sigma _f^2$ ) in which G is the genomic relationship matrix and ${\bm{I}}_H^{J^{\prime}J}$ is the identity matrix of variance among harvests; and $\varepsilon$ is the residual random effect, with $\varepsilon$ ∼ N(0, ${\bm{I}}\sigma _\varepsilon ^2$ ). ${\bm{X}},\ {{{\bm{Z}}}_r},\ {{{\bm{Z}}}_c},\ {{{\bm{Z}}}_f},$ and ${{{\bm{Z}}}_{fh}}$ were incidence matrices associated with the vectors $b,\ r,\ c,\ {\mathrm{and}}\ f$ . For both scenarios, single and multi-harvests, the models were fitted using the GBLUP approach (VanRaden, 2008). The genetic covariance, G matrix, was estimated using the AGHmatrix R package (Amadeu et al., 2023), using the ratio option: ${\bm{G\ }} = \frac{{ZZ^{\prime}}}{k}$ , in which Z is the mean centered of matrix M of allele ratio values and $k$ is the scale vector, $k = \sum_i^m s_i^2,$ which is the sum of the sampling variance of each marker (de Bem Oliveira et al., 2020). The models were fitted using the package ASReml-R v.4.0 (Butler et al., 2017).

Predictions based on pedigree information

The additive relationship matrix (A) was computed from the pedigree information of the nondormant alfalfa reference population using the AGHmatrix R package (Amadeu et al., 2023), following the option considering autotetraploidy (Kerr et al., 2012). Pedigree-based prediction models (ABLUP) were fitted replacing G by A in Equations (3) and (4).

Cross-validation and PA

The cross-validation (CV) scheme randomly assigned 90% of the families to the training set and 10% to the validation set. As the presence of a population structure can affect the prediction ability (Isidro et al., 2015), the sampling process for the CV was stratified, that is, the same number of families were sampled within each subpopulation. The CV was repeated a 100 times, using different training and validation sets. The PA was obtained in the CV scheme by calculating the Pearson correlation between predicted (GEBVs) and empirical adjusted phenotypes (eBLUEs for single harvest scenarios and eBLUPs for multiple harvest scenarios).

Marker density optimization

Different scenarios of marker density were created by randomly selecting a subset of markers from the full set to create the G matrix. Fourteen scenarios were explored for Capture-seq for the following number of markers: 10; 50; 100; 500; 1000; 1500; 2190; 5000; 10,000; 20,000; 40,000; 70,000; and 114,945. For DArTag seven scenarios were created: 10, 50, 100, 500, 1000, 1500, and 2190 markers. For the first scenario, 10 markers were randomly sampled to create the G matrix. A cumulative approach was applied in the following scenarios with increased marker density to avoid bias due to the informative nature of the markers. For example, the first subset of 10 markers was included in the first subset of 50 markers, which in turn was also included in the first subset of 100 markers, and so on. In each scenario, new markers were added by random sampling. Random sampling of the subset of markers was performed 10 times for each scenario, with the exception of the full set of markers.

Prediction schemes

Two prediction schemes were explored to mimic potential scenarios that alfalfa breeders might face while predicting family performance within and across harvests. Harvests were considered observed and unobserved to indicate whether phenotypic data of the target harvest was included in the calibration set. Two different prediction schemes were evaluated considering the different harvests:

Prediction scheme I: The training set was composed of a set of families within one harvest following a 10-fold CV scheme, and the validation sets were composed of the remaining families. PA was assessed between predicted and adjusted phenotypes (BLUEs) within the same harvest.
Prediction scheme II: The training set was composed of data from multiple harvests. Starting with harvest 1, the GP model was built to predict the accumulated performance across all harvests for untested families. Then, the model was updated sequentially, by adding data to include more harvests in the training set. The validation set was composed of untested families in the harvests used as training set. PA was assessed between predicted and the BLUPs across all harvests (mean performance for the 11 harvests).

Effect of relatedness in marker density optimization

Due to the presence of population structure in which the families are related either by sharing a common male or female progenitor, a scenario was explored to observe the effect of population structure in marker density optimization for which training and validation sets were unrelated. To accomplish this, all families derived from a cross involving ${\mathrm{mal}}{{{\mathrm{e}}}_i}$ or ${\mathrm{femal}}{{{\mathrm{e}}}_j}$ were eliminated from the training set, and a model was fitted following Equation (3) to predict the performance of the family derived from the cross ${\mathrm{mal}}{{{\mathrm{e}}}_i}\ \times \ {\mathrm{femal}}{{{\mathrm{e}}}_j}$ . Different scenarios of marker density were created as described in Section 2.13 using SNP data from DArTag 3K. PA was estimated as the Pearson correlation between predicted (GEBVs) and empirical adjusted phenotypes (eBLUEs) of the 134 full-sib families within each harvest.

Data availability

Phenotypic data, genomic kernels used to run the models, and R scripts for the analysis performed in this study are available at .

RESULTS

Genotyping platforms and population genetics

The DArTag platform targeted 3000 markers and 2190 (73%) were kept after filtering, while the Capture-seq platform targeted 17,707 regions, yielding 114,945 filtered markers. Both genotypic datasets were then used to build a genomic relationship matrix (G matrix).

LD decay was estimated separately with genotypic data from both platforms within each chromosome (Table S1). For the analysis performed with Capture-seq data, the LD decay, in which the LD was 0.2 or lower ranged, on average, between 461 kb (chromosome 1) and 482 kb (chromosome 6). The average distance between markers ranged between 221 kb (chromosome 1) and 405 kb (chromosome 6) (Figure S1). The analysis with DArTag showed that LD decay ranged between 530 kb (chromosome 1) and 547 kb (chromosome 5) with an average distance between markers ranging from 265 kb (chromosome 1) to 605 kb (chromosome 6) (Figure S2).

Population structure

The population structure analysis was performed using principal component analyses (PCA) on the G matrix computed using the complete set of markers from DArTag and Capture-seq platforms. The first two principal components (PC1 and PC2) explained the observed variation in 15.8% and 12% for DArTag, and 13.4% and 12.5% for Capture-seq, respectively. The clustering of families resembled the mating design used to create the population, with six subpopulations grouped by their male parent. Similar clustering and population structure were observed between genotyping technologies (Figure 1).

[IMAGE OMITTED. SEE PDF]

Population structure analysis with reduced marker density showed that population structure was not captured when using less than 500 markers (Figures S3 and S4). Similar results were obtained for both genotyping platforms.

PA of DMY by harvest using the complete set of markers

Mean DMY ranged from 698.78 to 3128.72 kg ha⁻¹ per harvest. Harvests during spring (H1–H3) had the highest DMY, while lower DMY was recorded during the harvests performed in fall (H6–H8). Phenotypic variation for DMY is summarized in Figure 2a. H² fluctuated from low to relatively moderate estimates. H1 and H3 had the highest H², while H11 had the lowest estimate (Figure 2b).

[IMAGE OMITTED. SEE PDF]

GP models were developed by harvest with a G-BLUP model using the entire set of markers available for each platform. Overall, there were no differences between the performance of both platforms across harvests for PA (Figure 2c). The PA fluctuated throughout the year and it did not resemble the fluctuation in H² estimates (Figure 2b,c). In fact, H1 had the highest H² estimate and one of the lowest PA. The PA for DArTag 3K ranged between 0.13 and 0.41, while PA using Capture-seq data ranged between 0.08 and 0.42 (Figure 2c). For both genotyping platform data, GP models for harvests 1, 2, and 6 displayed lower PA, while harvests 4, 8, 9, and 10 had higher PA.

Effect of marker density in GP of DMY by harvest

Prediction models were fitted by harvest using the GBLUP models with different marker densities for each platform. For both genotyping technologies, PA increased with an increase in marker density until a plateau was reached (Figure 3). For most harvests, there were no differences in PA between GP performed using 500 markers and the full set of markers from Capture-seq. PA for harvests 1 and 6 was low regardless of the marker density, and in harvests 8 and 10, the plateau was reached at 1000 markers (Figure S5).

[IMAGE OMITTED. SEE PDF]

Similar results were obtained when using different marker densities from the DArTag panel to build the G matrix in a G-BLUP model. For most harvests, the plateau in PA was reached when using 500 markers to estimate G matrix. Harvests 1 and 6 displayed a lower response to an increase in marker densities (Figure S5).

For both genotyping platforms, among the harvests that showed an increase in PA with increasing marker density, harvest 2 and harvest 8 were the harvests with lowest and highest PA, respectively (Figure 3).

Effects of relatedness in marker density optimization within harvests

GBLUP models were fitted within harvests accounting for population structure by creating training and testing sets with and without relatedness. For both scenarios, PA increased while marker density increased, and a plateau in PA was reached after 500 markers in most harvests (Figure 4). Predictions with related training and validation sets yielded higher PA compared to unrelated sets in most of the harvests, except for harvests 1 and 7 (Figure 4). The difference, in PA, between scenarios was higher with lower marker density but remained stable after 500 markers.

[IMAGE OMITTED. SEE PDF]

Pedigree-based predictions resulted in higher PA when training and validation sets were related compared to unrelated sets. In general, ABLUP resulted in lower PA compared to GBLUP with the full set of markers in both scenarios (Table S2). GP with 500 markers resulted in higher PA in 8 and 7 harvests compared to ABLUP in the related and unrelated sets, respectively. The gain in PA using marker data compared to ABLUP was higher in the unrelated scenario.

Effects of marker density, genotyping platform, and phenotypic data in GP of annual accumulated DMY

GP was performed to predict annual accumulated DMY with a subset of 500 markers (Figure S6) and the full set of markers for both genotyping platforms. GP was performed by training the model with increasing phenotypic information, with scenarios created with yield data from harvests 1 and 2 (H1:H2) and progressively adding more records from subsequent harvests until the model was built with yield data from all harvests (H1:H11). In all scenarios, maximum PA was achieved when models were trained with phenotypic data from all harvests (Figure 5). GP models using the full genotypic data from DArTag 3K had PA that ranged between 0.18 and 0.33 (Figure 5b). GP models developed with a subset of 500 markers yielded similar results with PA ranging between 0.2 and 0.34 (Figure 5a). Similar results were obtained with Capture-seq where GP with the full genotypic data had a PA between 0.15 and 0.31 (Figure 5d), while GP with a subset of 500 markers yielded PA between 0.23 and 0.35 (Figure 5c).

[IMAGE OMITTED. SEE PDF]

Pedigree-based predictions followed a similar pattern compared to GBLUP but yielded lower PA, ranging from 0.16 to 0.29 (Figure 5). On average, the gain in PA when using 500 markers for GBLUP was 17.2% and 25% for DArTag 3K and Capture-seq, respectively.

DISCUSSION

Genotyping platforms, LD, and population structure

The benefit for targeted amplicon-based genotyping technologies such as DArTag and Capture-Seq, as opposed to GBS, is the relatively low missing data rates and the query of the same exact loci in all samples across projects. The medium-density panel DArTag 3K and the Capture-seq approach are very promising platforms for alfalfa breeding programs. Capture-seq genotyping yielded more markers than the DArTag 3K panel, since it originally targeted more genomic regions and multiple variants were called within the same probe region. Nonetheless, both platforms provided similar results in terms of population structure, LD decay, and GP in our breeding population.

LD is a factor affecting PA in GP models (Liu et al., 2018; Xu et al., 2020). Outcrossing species, such as alfalfa, are highly heterozygous and experience rapid LD decay. When LD decays in a short distance, more markers are required to cover the genome and establish associations between marker data and traits (Flint-Garcia et al., 2003). Nevertheless, plant breeding populations are highly structured and have extensive LD requiring less markers to capture the genetic variability among genotypes (X. Li & Brummer, 2012).

In our population, LD decay displayed similar results between genotyping platforms when using the same number of markers. Similarly, the LD decay analysis with 2190 markers was comparable to the values reported by Murad Leite Andrade et al. (2022). The slightly inflated LD estimated by the DArTag 3K panel could be related to the markers being more evenly spaced and having on average a larger distance between them compared to Capture-seq data (Figures S7 and S8). Different studies of LD decay have reported contrasting results for alfalfa (Annicchiarico et al., 2022; X. Li et al., 2011; Sakiroglu et al., 2012). Annicchiarico et al. (2022) reported that LD decayed in short distances (100 of bases), while X. Li et al. (2011) reported LD decayed in 1 Mbp. Results from different studies could be related to the diversity of the populations under study, the number of markers used to analyze LD decay, distance between markers, and threshold and window chosen for the analysis. The findings in the present study suggest the presence of extensive linkage blocks in this breeding population, which could be the result of the mating design used to develop 134 full-sib families and 26 half-sib families.

PA of DMY by harvest using complete set of markers

The phenotypic records for DMY and heritability varied across the 11 harvests, indicating significant genotype-by-harvest interactions as reported previously (Acharya et al., 2020). The fluctuations in DMY and heritability could explain the variation in PA observed through the harvests, since PA is affected by the heritability of the trait and the precision of phenotypic measurements. Interestingly, higher PA was recorded in late spring and early summer, coinciding with the peak of alfalfa production in Florida. High PA was also recorded in winter following the lowest period of production in Florida. Lowest PA values were observed in harvest 6 at the end of summer when alfalfa experiences a severe decline in growth known as the summer slump (Ottman & Mostafa, 2014). Lower PA values could be explained by genotype × harvest interactions as plants transition from active vegetative growth states to dormant states and vice versa.

Similar results in PA obtained with both genotyping platforms agree with the population structure and LD analysis reported in the present study. Both genotyping platforms likely captured the genotypic variance in a similar fashion to reconstruct the relationship matrices used for GP. Elbasyoni et al. (2018) and Moraes et al. (2018) compared prediction ability between GBS and array-scored SNPs in winter wheat and pine trees, respectively, with comparable results to our study, where no differences were observed across platforms.

Effect of marker density in GP of DMY by harvest

The PA of GP models is affected by marker density (Liu et al., 2018; Xu et al., 2020). We found that in most harvests, sets of 500–1000 randomly selected markers from both genotyping platforms yielded predictive abilities comparable to the full set of markers (Figure 3). Previous studies have explored the effects of marker density on PA from training populations phenotyped and genotyped at the individual level in alfalfa. For the fall dormancy trait, Zhang et al. (2023) observed a slightly higher PA for 100 selected markers (PA = 0.611) compared with a full set of 875,023 GBS markers (PA = 0.596). For DMY, 4000 SNP markers selected by mean-squared-estimated-marker effects provided the highest PA (0.36) compared to a full set of 4932 SNPs (Wang et al., 2024). The differences observed with our results could be due to the population structure of the populations analyzed, the extent of LD, and the different genotyping approach (individual vs. family-bulked). Although Wang et al. (2024) report population to be structured, there is no report on LD decay. The relatively low number of markers needed to reach the plateau in PA in our study is consistent with the extensive LD decay observed in our population. Further studies comparing population structure analysis, LD decay, and PA between individually estimated GEBV and GWFP could assess if the genotyping approach has effects on the number of markers required to achieve maximum PA.

A main factor affecting PA of GP models is the relationship between training and validation sets (Wientjes et al., 2013). Our results agree with previous studies observing a loss in PA for most harvests when training and validation sets were unrelated. Interestingly, we did not observe differences in the number of markers needed to achieve maximum PA when comparing related and unrelated scenarios. Hickey et al. (2014) demonstrated that fewer phenotypes and markers (200–500) were needed to reach an effective prediction accuracy when predicting candidates that were closely related to the training population. When candidates for selection were unrelated to the training set, up to 10,000 markers and 1000 of phenotypes were required to effectively train prediction models. Norman et al. (2018) reported more response in prediction accuracy to increased marker density when there was greater diversity in the training set. In this context, it is expected that for genetically broader alfalfa populations, a larger number of markers will be needed to achieve maximum PA. Further studies with more diverse and less structured populations could elucidate the optimal marker density to predict key traits in this species.

Our results agree with previous studies in other outcrossing species that also have shown a plateau in PA while increasing marker density. In blueberries, 5000 markers resulted in similar PA than 86,000 markers using Capture-seq data (de Bem Oliveira et al., 2020). In self-pollinating species, similar results have been observed, that is, 1829 GBS markers produced PA not significantly different than 35,000 markers in wheat (Poland et al., 2012).

In most harvests, ABLUP prediction models resulted in lower PA when compared with predictions using 500 (or more) markers from DArTag 3K with related training and validation sets. When comparing ABLUP and GBLUP with 500 markers with unrelated training and validation sets, the gain in PA when performing GBLUP was, on average, 15% (Table S2). Similar results have been reported in wheat where GBLUP models developed from 1447 markers outperformed pedigree models in three out of four environments (Crossa et al., 2010) and in sorghum where GBLUP with 4781 evenly spaced SNPs outperformed ABLUP for several traits within and among family predictions (Velazco et al., 2019).

The findings in this study are promising as they enable alfalfa breeders to optimize a GP pipeline, lowering genotyping costs and computational resources by using a lower density of markers combined with the GWFP scheme. Simulations in maize with different marker densities demonstrated less decrease in PA over generations when increasing marker density (DoVale et al., 2022). Future studies should address the use of mid-density genotyping platforms in GWFP over several generations in a breeding program or using less-structured and more diverse breeding populations as training and validation sets.

Effects of marker density, genotyping platform, and phenotypic data in GP of annual accumulated DMY

In subtropical conditions, nondormant alfalfa can be harvested more than 10 times in a calendar year, and for this reason the stand longevity for nondormant alfalfa is lesser than for dormant cultivars grown in more temperate regions (Ventroni et al., 2010). Reduction of the number of harvests to train a GP model would lower costs and allow for reallocation of resources to increase the size of the testing population and/or training population. Murad Leite Andrade et al. (2022) and Filho et al. (2023) demonstrated that five consecutive harvests are enough to achieve the same PA as models with all harvests in a given year. In our study, lowering the number of markers to 500 had no effect in the PA for accumulated DMY, reaching the same conclusion about the optimization of number of harvests. Pedigree-based prediction models would allow breeders to avoid genotyping costs and perform selection of untested genotypes. Our results indicate that predictions with low marker density considerably outperformed pedigree-based models for cumulative DMY per year in both genotyping platforms.

CONCLUSIONS

Our study demonstrates the reliability of GWFP to predict DMY in a nondormant alfalfa breeding population. We showed the feasibility of two genotyping platforms, DArTag and Capture-seq, to generate genomic data for alfalfa family bulks. The two platforms yielded similar results for population structure, LD, and PA. The optimal marker density for GWFP in the UF breeding program population was 500 when predicting DMY per harvest. The prediction of the total accumulated DMY per year can be achieved with data collected from just five harvests and using 500 markers, emphasizing the efficiency of the model without compromising its predictive performance in a highly structured alfalfa breeding population. Marker-based predictions outperformed pedigree-based predictions for total accumulated DMY per year showing the added value of incorporating GWFP to alfalfa breeding programs. Both platforms are available for the alfalfa community to genotype populations with repeatable and reliable markers and have the flexibility to be adapted to a reduced set of markers.

AUTHOR CONTRIBUTIONS

Pablo Sipowicz: Data curation; formal analysis; investigation; validation; visualization; writing—original draft. Mario Henrique Murad Leite Andrade: Data curation; formal analysis; supervision; writing—review and editing. Claudio Carlos Fernandes Filho: Data curation; formal analysis; investigation; supervision; writing—review and editing. Juliana Benevenuto: Data curation; formal analysis; supervision; writing—review and editing. Patricio Muñoz: Conceptualization; funding acquisition; methodology; project administration; resources; supervision; writing—review and editing. L. Felipe V. Ferrão: Conceptualization; methodology; software; writing—review and editing. Marcio F. R. Resende Jr.: Conceptualization; methodology; supervision; writing—review and editing. C. Messina: Investigation; methodology; supervision; writing—review and editing. Esteban F. Rios: Conceptualization; funding acquisition; investigation; methodology; project administration; resources; supervision; writing—review and editing.

CONFLICT OF INTEREST STATEMENT

The authors declare no conflicts of interest.

DATA AVAILABILITY STATEMENT

Phenotypic and genotypic data, R scripts developed for the analysis and genomic kernels can be accessed by request to the corresponding author.

References

Acharya, J. P., Lopez, Y., Gouveia, B. T., de Bem Oliveira, I., Resende, M. F., Muñoz, P. R., & Rios, E. F. (2020). Breeding alfalfa (Medicago sativa L.) adapted to subtropical agroecosystems. Agronomy, 10(5), 742. [DOI: https://dx.doi.org/10.3390/agronomy10050742]

Word count: 7083

Show less

© 2025. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the "License"). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

Alfalfa (Medicago sativa L.) is a perennial forage legume esteemed for its exceptional quality and dry matter yield (DMY); however, alfalfa has historically exhibited low genetic gain for DMY. Advances in genotyping platforms paved the way for a cost‐effective application of genomic prediction in alfalfa family bulks. In this context, the optimization of marker density holds potential to reallocate resources within genomic prediction pipelines. This study aimed to (i) test two genotyping platforms for population structure discrimination and predictive ability (PA) of genomic prediction models (G‐BLUP) for DMY, and (ii) explore optimal levels of marker density to predict DMY in family bulks. For this, 160 nondormant alfalfa families were phenotyped for DMY across 11 harvests and genotyped via targeted sequencing using Capture‐seq with 17K probes and the DArTag 3K panel. Both platforms discriminated similarly against the population structure and resulted in comparable PA for DMY. For genotyping optimization, different levels of marker density were randomly extracted from each platform. In both cases, a plateau was achieved around 500 markers, yielding similar PA as the full set of markers. For phenotyping optimization, models with 500 markers built with data from five harvests resulted in similar PA compared to the full set of 11 harvests and full set of markers. Altogether, genotyping and phenotyping efforts were optimized in terms of number of markers and harvests. Capture‐seq and DArTag yielded similar results and have the flexibility to adjust their panels to meet breeders’ needs in terms of marker density.

Details

Title

Optimization of high‐throughput marker systems for genomic prediction in alfalfa family bulks

Author

Sipowicz, Pablo¹; Murad Leite Andrade, Mario Henrique²; Fernandes Filho, Claudio Carlos³; Benevenuto, Juliana⁴

; Muñoz, Patricio⁴

; Ferrão, L. Felipe V.⁴

; Resende, Marcio F. R.⁴

; Messina, C.⁴; Rios, Esteban F.⁵

¹ Plant Breeding Graduate Program, University of Florida, Gainesville, Florida, USA, Instituto Nacional de Tecnologia Agropecuaria, Manfredi, Argentina
² School of Food and Agriculture, University of Maine, Orono, Maine, USA
³ Centro de Tecnologia Canavieira (CTC), Piracicaba, São Paulo, Brazil
⁴ Horticultural Sciences Department, University of Florida, Gainesville, Florida, USA
⁵ Agronomy Department, University of Florida, Gainesville, Florida, USA

Section

SPECIAL SECTION: TRIBUTE TO RON PHILLIPS: CROP GENETICS, GENOMICS AND BIOTECHNOLOGY

Publication year

2025

Publication date

Mar 1, 2025

Publisher

John Wiley & Sons, Inc.

ISSN

19403372

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1002/tpg2.20526

ProQuest document ID

3181493296

Optimization of high‐throughput marker systems for genomic prediction in alfalfa family bulks

Jump to:

Full text

Abstract

Details

Suggested sources