Full text

Turn on search term navigation

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Background/Objectives: This study examines the effectiveness of train-time computation, test-time computation, and their combination on the performance of large language modeling applied to an electronic medical record quality management system. It identifies the most effective combination of models to enhance clinical documentation performance and efficiency. Methods: A total of 597 clinical medical records were selected from the MEDEC-MS dataset, 10 of which were used for prompt engineering to guide model training. Eight large language models were employed for training, focusing on train-time computation and test-time computation. Model performance on specific error types was assessed using precision, recall, F1 score, and error correction accuracy. The dataset was divided into training and testing sets in a 7:3 ratio. The assembly model was created using binary logistic regression for assembly analysis of the top-performing models. Its performance was evaluated using area under the curve values and model weights. Results: GPT-4 and Deepseek R1 demonstrated higher overall accuracy in detecting errors. Models that focus on train-time computation exhibited shorter reasoning times and stricter error detection, while models emphasizing test-time computation achieved higher error correction accuracy. The GPT-4 model was particularly effective in addressing issues related to causal organisms, management, and pharmacotherapy, whereas models focusing on test-time computation performed better in tasks involving diagnosis and treatment. The assembly model, focusing on both train-time computation and test-time computation, outperformed any single large language model (Assembly model accuracy: 0.690 vs. GPT-4 accuracy: 0.477). Conclusions: Models focusing on train-time computation demonstrated greater efficiency in processing speed, while models focusing on test-time computation showed higher accuracy and interpretability in identifying and detecting quality issues in electronic medical records. Assembling the train-time and test-time computation strategies may strike a balance between high accuracy and model efficiency, thereby enhancing the development of electronic medical records and improving medical care.

Details

Title
Train-Time and Test-Time Computation in Large Language Models for Error Detection and Correction in Electronic Medical Records: A Retrospective Study
Author
Cai Qiong 1   VIAFID ORCID Logo  ; Lanting, Yang 1 ; Xiao Jiangping 1 ; Ma Jiale 1 ; Liu Molei 2 ; Pan Xilong 1 

 Department of Social Medicine and Health Education, School of Public Health, Peking University, Beijing 100191, China; [email protected] (Q.C.); 
 Department of Social Medicine and Health Education, School of Public Health, Peking University, Beijing 100191, China; [email protected] (Q.C.);, Department of Biostatistics, Peking University Health Science Center, Beijing 100191, China 
First page
1829
Publication year
2025
Publication date
2025
Publisher
MDPI AG
e-ISSN
20754418
Source type
Scholarly Journal
Language of publication
English
ProQuest document ID
3233140605
Copyright
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.