Content area

Abstract

Machine-learning (ML) is revolutionizing field and laboratory studies of animals. However, a challenge when deploying ML for classification tasks is ensuring the models are reliable. Currently, we evaluate models using performance metrics (e.g., precision, recall, F1), but these can overlook the ultimate aim, which is not the outputs themselves (e.g. detected species or individual identities, or behaviour) but their incorporation for hypothesis testing. As improving performance metrics has diminishing returns, particularly when data are inherently noisy (as human-labelled, animal-based data often are), researchers are faced with the conundrum of investing more time in maximising metrics versus doing the actual research. This raises the question: how much noise can we accept in ML models? Here, we start by describing an under-reported factor that can cause metrics to underestimate model performance. Specifically, ambiguity between categories or mistakes in labelling validation data produces hard ceilings that limit performance metrics. This likely widespread issue means that many models could be performing better than their metrics suggest. Next, we argue and show that imperfect models (e.g. low F1 scores) can still be useable. Using a case study on ML-identified behaviour from vulturine guineafowl accelerometer data, we first propose a simulation framework to evaluate robustness of hypothesis testing using models that make classification errors. Second, we show how to determine the utility of a model by supplementing existing performance metrics with 'biological validations' This involves applying ML models to unlabelled data and using the models' outputs to test hypotheses for which we can anticipate the outcome. Together, we show that effects sizes and expected biological patterns can be detected even when performance metrics are relatively low (e.g., F1: 60-70%). In doing so, we provide a roadmap for validation approaches of ML classification models tailored to research in animal behaviour, and other fields with noisy, biological data.

Competing Interest Statement

The authors have declared no competing interest.

Footnotes

* Revision defines the scope of the paper more clearly (using machine-learning for the classification of raw data to be used in posterior hypothesis testing). Revision entails additional methodological details (Ethical note, Figure S2 to show alignment of accelerometer data with labels).

Details

1009240
Business indexing term
Title
Moving towards more holistic machine learning-based approaches for classification problems in animal studies
Publication title
bioRxiv; Cold Spring Harbor
Publication year
2025
Publication date
Jan 27, 2025
Section
New Results
Publisher
Cold Spring Harbor Laboratory Press
Source
BioRxiv
Place of publication
Cold Spring Harbor
Country of publication
United States
University/institution
Cold Spring Harbor Laboratory Press
Publication subject
ISSN
2692-8205
Source type
Working Paper
Language of publication
English
Document type
Working Paper
Publication history
 
 
Milestone dates
2024-10-21 (Version 1)
ProQuest document ID
3160209667
Document URL
https://www.proquest.com/working-papers/moving-towards-more-holistic-machine-learning/docview/3160209667/se-2?accountid=208611
Copyright
© 2025. This article is published under http://creativecommons.org/licenses/by/4.0/ (“the License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Last updated
2025-01-28
Database
ProQuest One Academic