Content area

Abstract

Milk oligosaccharides are bioactive components that regulate the composition of the neonatal microbiota and exert immunomodulatory functions. Their beneficial effects depend on their structure. Numerous studies have shown intra- and inter-species variation in the structural composition and concentration of these compounds in mammalian milk, yet the biological significance of such variation remains poorly understood. Automated natural language processing methods are promising tools for extracting and gathering structured data from unstructured texts to get insight into the biological significance of milk oligosaccharide variation across mammals. These methods require training and evaluation on manually annotated text corpora. While annotated corpora exist for chemical substances, none are specifically designed for training natural language processing models to extract information on milk oligosaccharides. To this end, we propose MilkOligoCorpus, a new gold standard for milk oligosaccharide composition in mammalian species. MilkOligoCorpus annotation scheme is a rich entity/relation model designed to describe the diversity pattern of milk oligosaccharides according to female factor variability and to help better understand the structure-related function of milk oligosaccharides. MilkOligoCorpus consists of 30 PubMed texts fully annotated with entities related to individuals, samples, oligosaccharides and oligosaccharide quantification linked by binary and n-ary relationships. To address data interoperability across disparate publications and databases, four terminological resources were also developed to assign unique identifiers to the entities, supported by external ontologies. This paper presents the creation of the MilkOligoCorpus and its associated schema, along with the development of annotation guidelines and terminological resources. We also present experimental results obtained by baseline information extraction models on the corpus.

Competing Interest Statement

The authors have declared no competing interest.

Details

1009240
Title
MilkOligoCorpus: a semantically annotated resource for knowledge extraction on mammalian milk oligosaccharides
Publication title
bioRxiv; Cold Spring Harbor
Publication year
2025
Publication date
Feb 12, 2025
Section
New Results
Publisher
Cold Spring Harbor Laboratory Press
Source
BioRxiv
Place of publication
Cold Spring Harbor
Country of publication
United States
University/institution
Cold Spring Harbor Laboratory Press
Publication subject
ISSN
2692-8205
Source type
Working Paper
Language of publication
English
Document type
Working Paper
ProQuest document ID
3165948716
Document URL
https://www.proquest.com/working-papers/milkoligocorpus-semantically-annotated-resource/docview/3165948716/se-2?accountid=208611
Copyright
© 2025. This article is published under http://creativecommons.org/licenses/by/4.0/ (“the License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Last updated
2025-02-13
Database
ProQuest One Academic