Abstract

Ontologies are highly prevalent in biology and medicine and are always evolving. Annotating biological text, such as observed phenotype descriptions, with ontology terms is a challenging and tedious task. The process of annotation requires a contextual understanding of the input text and of the ontological terms available. While text-mining tools are available to assist, they are largely based on directly matching words and phrases and so lack understanding of the meaning of the query item and of the ontology term labels. Large Language Models (LLMs), however, excel at tasks that require semantic understanding of input text and therefore may provide an improvement for the auto-annotation of text with ontological terms. Here we describe a series of workflows incorporating OpenAI GPT’s capabilities to annotate Arabidopsis thaliana and forest tree phenotypic observations with ontology terms, aiming for results that resemble manually curated annotations. These workflows make use of an LLM to intelligently parse phenotypes into short concepts, followed by finding appropriate ontology terms via embedding vector similarity or via Retrieval-Augmented Generation (RAG). The RAG model is a state-of-the-art approach that augments conversational prompts to the LLM with context-specific data to empower it beyond its pre-trained parameter space. We show that the RAG produces the most accurate automated annotations that are often highly similar or identical to expert-curated annotations.

Details

Title
The effectiveness of large language models with RAG for auto-annotating trait and phenotype descriptions
Author
Kainer, David 1   VIAFID ORCID Logo 

 Faculty of Science, School of Agriculture and Food Sustainability, The University of Queensland, St Lucia, QLD 4072, Australia; ARC Centre of Excellence for Plant Success in Nature and Agriculture, St Lucia, QLD 4072, Australia  [email protected]
Section
Artificial Intelligence In Biology And Bioinformatics
Publication year
2025
Publication date
2025
Publisher
Oxford University Press
e-ISSN
23968923
Source type
Scholarly Journal
Language of publication
English
ProQuest document ID
3260067545
Copyright
© 2025 The Author(s) 2025. Published by Oxford University Press. This work is published under https://creativecommons.org/licenses/by-nc/4.0/ (the "License"). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.