Content area

Abstract

Haplotype-based statistics are widely used for finding genomic regions under positive selection. At the heart of many such statistics is the computation of extended haplotype homozygosity (EHH), which captures the decay of homozygosity away from a focal site. This computation, repeated for potentially millions of sites, is computationally demanding, as it involves tracking counts of unique haplotypes iteratively over long genomic distances and across many individuals. Because of these computational challenges, existing tools do not scale well when applied to large-scale population datasets, such as the 1,000 Genomes Project, or the UK Biobank with 500,000 individuals. Optimizing computation becomes crucial when data sets grow large, especially when handling large sample sizes or generating training data for machine learning algorithms. Here, we propose a dynamic programming algorithm that substantially improves runtime and memory usage over existing tools on both real and simulated data. On real phased data, we achieve 5–50x speedup with minimal memory footprint. Our simulations show an even more pronounced performance gap with large populations (up to 15x speedup and 46x memory reduction). EHH-based statistics designed for unphased genotypes run an order of magnitude faster, and multi-parameter support results in 20x runtime improvement. Source code and binaries are available at https://github.com/szpiech/selscan as selscan v2.1.

Details

Title
Fast and Memory-Efficient Dynamic Programming Approach for Large-Scale EHH-Based Selection Scans
Author
Rahman, Amatur 1   VIAFID ORCID Logo  ; Smith, T Quinn 1   VIAFID ORCID Logo  ; Szpiech, Zachary A 1   VIAFID ORCID Logo 

 Department of Biology, The Pennsylvania State University, University Park, PA 16802, USA  [email protected]; [email protected]
Author e-mail address
Publication title
Volume
42
Issue
11
Number of pages
13
Publication year
2025
Publication date
Nov 2025
Section
Resources
Publisher
Oxford University Press
Place of publication
Oxford
Country of publication
United Kingdom
Publication subject
ISSN
07374038
e-ISSN
15371719
Source type
Scholarly Journal
Language of publication
English
Document type
Journal Article
Publication history
 
 
Online publication date
2025-10-27
Milestone dates
2025-04-10 (Received); 2025-09-26 (Rev-Recd); 2025-10-10 (Accepted); 2025-11-27 (Corrected-Typeset)
Publication history
 
 
   First posting date
27 Oct 2025
ProQuest document ID
3276084471
Document URL
https://www.proquest.com/scholarly-journals/fast-memory-efficient-dynamic-programming/docview/3276084471/se-2?accountid=208611
Copyright
© 2025 The Author(s) 2025. Published by Oxford University Press on behalf of Society for Molecular Biology and Evolution. This work is published under https://creativecommons.org/licenses/by-nc/4.0/ (the "License"). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Last updated
2025-11-28
Database
ProQuest One Academic