Full text

Turn on search term navigation

© 2025. This work is licensed under https://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Background:Operative notes are frequently mined for surgical concepts in clinical care, research, quality improvement, and billing, often requiring hours of manual extraction. These notes are typically analyzed at the document level to determine the presence or absence of specific procedures or findings (eg, whether a hand-sewn anastomosis was performed or contamination occurred). Extracting several binary classification labels simultaneously is a multilabel classification problem. Traditional natural language processing approaches—bag-of-words (BoW) and term frequency-inverse document frequency (tf-idf) with linear classifiers—have been used previously for this task but are now being augmented or replaced by large language models (LLMs). However, few studies have examined their utility in surgery.

Objective:We developed and evaluated LLMs for the purpose of expediting data extraction from surgical notes.

Methods:A total of 388 exploratory laparotomy notes from a single institution were annotated for 21 concepts related to intraoperative findings, intraoperative techniques, and closure techniques. Annotation consistency was measured using the Cohen κ statistic. Data were preprocessed to include only the description of the procedure. We compared the evolution of document classification technologies from BoW and tf-idf to encoder-only (Clinical-Longformer) and decoder-only (Llama 3) transformer models. Multilabel classification performance was evaluated with 5-fold cross-validation with F1-score and hamming loss (HL). We experimented with and without context. Errors were assessed by manual review. Code and implementation instructions may be found on GitHub.

Results:The prevalence of labels ranged from 0.05 (colostomy, ileostomy, active bleed from named vessel) to 0.50 (running fascial closure). Llama 3.3 was the overall best-performing model (micro F1-score 0.88, 5-fold range: 0.88-0.89; HL 0.11, 5-fold range: 0.11-0.12). The BoW model (micro F1-score 0.68, 5-fold range: 0.64-0.71; HL 0.14, 5-fold range: 0.13-0.16) and Clinical-Longformer (micro F1-score 0.73, 5-fold range: 0.70-0.74; HL 0.11, 5-fold range: 0.10-0.12) had overall similar performance, with tf-idf models trailing (micro F1-score 0.57, 5-fold range: 0.55-0.59; HL 0.27, 5-fold range: 0.25-0.29). F1-scores varied across concepts in the Llama model, ranging from 0.30 (5-fold range: 0.23-0.39) for class III contamination to 0.92 (5-fold range: 0.98-0.84) for bowel resection. Context enhanced Llama’s performance, adding an average of 0.16 improvement to the F1-scores. Error analysis demonstrated semantic nuances and edge cases within operative notes, particularly when patients had references to prior operations in their operative notes or simultaneous operations with other surgical services.

Conclusions:Off-the-shelf autoregressive LLMs outperformed fined-tuned, encoder-only transformers and traditional natural language processing techniques in classifying operative notes. Multilabel classification with LLMs may streamline retrospective reviews in surgery, though further refinements are required prior to reliable use in research and quality improvement.

Details

Title
Language Models for Multilabel Document Classification of Surgical Concepts in Exploratory Laparotomy Operative Notes: Algorithm Development Study
Author
Balch, Jeremy A  VIAFID ORCID Logo  ; Desaraju, Sasank S  VIAFID ORCID Logo  ; Nolan, Victoria J  VIAFID ORCID Logo  ; Vellanki, Divya  VIAFID ORCID Logo  ; Buchanan, Timothy R  VIAFID ORCID Logo  ; Brinkley, Lindsey M  VIAFID ORCID Logo  ; Penev, Yordan  VIAFID ORCID Logo  ; Bilgili, Ahmet  VIAFID ORCID Logo  ; Patel, Aashay  VIAFID ORCID Logo  ; Chatham, Corinne E  VIAFID ORCID Logo  ; Vanderbilt, David M  VIAFID ORCID Logo  ; Uddin, Rayon  VIAFID ORCID Logo  ; Bihorac, Azra  VIAFID ORCID Logo  ; Efron, Philip  VIAFID ORCID Logo  ; Loftus, Tyler J  VIAFID ORCID Logo  ; Rahman, Protiva  VIAFID ORCID Logo  ; Shickel, Benjamin  VIAFID ORCID Logo 
First page
e71176
Section
AI Language Models in Health Care
Publication year
2025
Publication date
2025
Publisher
JMIR Publications
e-ISSN
22919694
Source type
Scholarly Journal
Language of publication
English
ProQuest document ID
3232146741
Copyright
© 2025. This work is licensed under https://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.