Content area

Abstract

Non-coding single nucleotide polymorphisms (SNPs) are critical drivers of gene regulation and disease susceptibility, yet predicting their functional impact remains a challenging task. A variety of methods exist for encoding non-coding SNPs, such as direct base encoding or using pre-trained models to obtain embeddings. However, there is a lack of comprehensive evaluation and guidance on the choice of encoding strategies for downstream prediction tasks involving non-coding SNPs. To address this gap, we present a benchmark study that compares six distinct encoding strategies for non-coding SNPs, assessing them across six dimensions, including interpretability, encoding abundance, and computational efficiency. Using three Quantitative Trait Loci (QTL)-related downstream tasks involving non-coding SNPs, we test these encoding strategies in combination with nine machine learning and deep learning models. Our findings demonstrate that semantic embeddings show strong robustness, while the choice of coding strategy and the model used for downstream prediction are all key variables influencing task performance. This benchmark provides actionable insights into the interplay between encoding strategies, models, and data properties, offering a framework for optimizing QTL prediction tasks and advancing the analysis of non-coding SNPs in genomic regulation.

Competing Interest Statement

The authors have declared no competing interest.

Details

1009240
Business indexing term
Title
Benchmarking the coding strategies of non-coding mutations on sequence-based downstream tasks with machine learning
Publication title
bioRxiv; Cold Spring Harbor
Publication year
2025
Publication date
Jan 2, 2025
Section
New Results
Publisher
Cold Spring Harbor Laboratory Press
Source
BioRxiv
Place of publication
Cold Spring Harbor
Country of publication
United States
University/institution
Cold Spring Harbor Laboratory Press
Publication subject
ISSN
2692-8205
Source type
Working Paper
Language of publication
English
Document type
Working Paper
ProQuest document ID
3150948807
Document URL
https://www.proquest.com/working-papers/benchmarking-coding-strategies-non-mutations-on/docview/3150948807/se-2?accountid=208611
Copyright
© 2025. This article is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (“the License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Last updated
2025-01-03
Database
ProQuest One Academic