Content area

Abstract

Because of the rapidly-growing amount of sequencing data, computing sketches of large textual datasets has become an essential preprocessing task. These sketches are typically much smaller than the input sequences, but preserve sufficient information for downstream analysis. Minimizers are an especially popular sketching technique and used in a wide variety of applications. They sample at least one out of every w consecutive k-mers. As DNA sequencers are getting more accurate, some applications can afford to use a larger w and hence sparser and smaller sketches. And as sketches get smaller, their analysis becomes faster, so the time spent sketching the full-sized input becomes more of a bottleneck. Our library simd-minimizers implements a random minimizer algorithm using SIMD instructions. It supports both AVX2 and NEON architectures. Its main novelty is two-fold. First, it splits the input into 8 chunks that are streamed over in parallel through all steps of the algorithm. This is enabled by using the completely deterministic two-stacks sliding window minimum algorithm, which seems not to have been used before for finding minimizers. Our library is up to 9.5x faster than a scalar implementation of the rescan method when w=5 is small, and 4.5x faster for larger w=19. Computing canonical minimizers is only around 50% slower than computing forward minimizers, and around 16x faster than the existing implementation in the minimizer-iter crate. Our library finds all (canonical) minimizers of a 3.2Gbp human genome in 4.1 (resp. 6.0) seconds. Availability: simd-minimizers is available at https://github.com/rust-seq/simd-minimizers

Competing Interest Statement

The authors have declared no competing interest.

Details

1009240
Title
SimdMinimizers: Computing random minimizers, fast
Publication title
bioRxiv; Cold Spring Harbor
Publication year
2025
Publication date
Jan 27, 2025
Section
New Results
Publisher
Cold Spring Harbor Laboratory Press
Source
BioRxiv
Place of publication
Cold Spring Harbor
Country of publication
United States
University/institution
Cold Spring Harbor Laboratory Press
Publication subject
ISSN
2692-8205
Source type
Working Paper
Language of publication
English
Document type
Working Paper
ProQuest document ID
3160207776
Document URL
https://www.proquest.com/working-papers/simdminimizers-computing-random-minimizers-fast/docview/3160207776/se-2?accountid=208611
Copyright
© 2025. This article is published under http://creativecommons.org/licenses/by/4.0/ (“the License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Last updated
2025-01-28
Database
ProQuest One Academic