Content area
Full text
Background
Accuracy of genomic predictions can be improved by using more variants, including variants that are pre-selected for their effect, located near genes or within genes, predicted to affect gene function, or known to be causal. Past analyses often gave equal weight to evenly spaced markers, whereas new analyses can focus on potential quantitative trait loci (QTL) or preselected variants that are more closely linked to QTL. Nearly 40 million variants have been identified from whole-genome sequence (WGS) data for over 1500 bulls, and several strategies to impute these variants to additional animals and use them in genetic evaluation for economic traits show potential [1-8]. For example, candidate variants can be targeted to specific traits such as genes related to fertility, thereby slightly improving reliability for daughter pregnancy rate by 0.2 percentage points when 39 single nucleotide polymorphisms (SNPs) were added to the marker set used for genomic prediction [9]. The number of sequenced animals should continue to increase as researchers examine more families and the costs of generating data continue to decrease.
Imputing, selecting, and predicting effects for millions of variants and many thousands of individuals require efficient computation. Computational costs, which are proportional to the number of variants multiplied by the number of individuals, could exceed the marginal benefits from adding more variants. Variants within or near genes should improve the reliability of predictions, and direct use of causal variants is preferred to using linked markers. Strategies to choose variants for inclusion on genotyping arrays of different densities or in routine predictions were developed and compared using simulated data for Holstein bulls. Here, we first examined simulated data and then real sequence genotypes from the 1000 Bull Genomes Project [10].
The goals of this study were to (1) compare the reliability of prediction from sequence, array, and combined data as well as different types of variants, (2) test the methods first on simulated data before applying them to real sequence data imputed for a large reference population, and (3) investigate editing, imputation, and computing strategies that are efficient for even larger genotyped populations.
Methods
Simulated sequence data
Our simulation was designed to closely mimic an actual large-scale sequencing project for cattle, in which a subset of ancestor bulls had WGS data, another subset...