Content area

Abstract

Ontologies are highly prevalent in biology and medicine and are always evolving. Annotating biological text, such as observed phenotype descriptions, with ontology terms is a challenging and tedious task. The process of annotation requires a contextual understanding of the input text and of the ontological terms available. While text-mining tools are available to assist, they are largely based on directly matching words and phrases and so lack understanding of the meaning of the query item and of the ontology term labels. Large Language Models (LLMs), however, excel at tasks that require semantic understanding of input text and therefore may provide an improvement for the auto-annotation of text with ontological terms. Here we describe a series of workflows incorporating OpenAI GPT’s capabilities to annotate Arabidopsis thaliana and forest tree phenotypic observations with ontology terms, aiming for results that resemble manually curated annotations. These workflows make use of an LLM to intelligently parse phenotypes into short concepts, followed by finding appropriate ontology terms via embedding vector similarity or via Retrieval-Augmented Generation (RAG). The RAG model is a state-of-the-art approach that augments conversational prompts to the LLM with context-specific data to empower it beyond its pre-trained parameter space. We show that the RAG produces the most accurate automated annotations that are often highly similar or identical to expert-curated annotations.

Details

1009240
Title
The effectiveness of large language models with RAG for auto-annotating trait and phenotype descriptions
Author
Kainer, David 1   VIAFID ORCID Logo 

 Faculty of Science, School of Agriculture and Food Sustainability, The University of Queensland , St Lucia, QLD 4072, Australia 
Publication title
Volume
10
Issue
1
Publication year
2025
Publication date
2025
Publisher
Oxford University Press
Place of publication
Oxford
Country of publication
United Kingdom
Publication subject
e-ISSN
23968923
Source type
Scholarly Journal
Language of publication
English
Document type
Journal Article
Publication history
 
 
Online publication date
2025-02-26
Milestone dates
2025-01-21 (Received); 2025-02-24 (Accepted); 2025-02-14 (Rev-recd); 2025-03-04 (Corrected)
Publication history
 
 
   First posting date
26 Feb 2025
ProQuest document ID
3238701827
Document URL
https://www.proquest.com/scholarly-journals/effectiveness-large-language-models-with-rag-auto/docview/3238701827/se-2?accountid=208611
Copyright
© The Author(s) 2025. Published by Oxford University Press. This work is published under http://creativecommons.org/licenses/by-nc/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Last updated
2025-09-27
Database
ProQuest One Academic