Go to:
INTRODUCTION
Studies have increasingly reported on the accuracy and utility of large language models (LLMs) or large multimodal models across a wide range of medical contexts, underscoring their potential to transform the healthcare landscape [1, 2, 3]. Nevertheless, significant variability in methodologies and reporting practices has been observed across these studies, leading to challenges in assessing and interpreting their findings [4, 5]. As this body of research grows, the lack of standardized reporting has resulted in inconsistent methodological practices, complicating the process of assessing and replicating findings [4, 5, 6, 7].
Several reporting checklists and guidelines have been proposed to address these issues. The Minimum Reporting Items for Clear Evaluation of Accuracy Reports of Large Language Models in Healthcare (MI-CLEAR-LLM) checklist was recently developed to provide a set of essential items uniquely tailored to the use of LLMs, distinguishing it from those for general artificial intelligence (AI) applications [5]. These items represent the minimum requirements necessary for the transparent reporting of clinical studies that assess the accuracy of LLM use in healthcare applications. The MI-CLEAR-LLM checklist identifies six essential reporting items that address critical areas affecting the reproducibility and interpretability of LLM performance: 1) identification and specifications of the LLM used, 2) how stochasticity was managed, 3) detailed reporting of prompt wording and syntax, 4) how prompts were structured, 5) details of any prompt testing and optimization conducted, and 6) clarification of the independence of the test and training datasets [5]. Although other guidelines are still under development, such as the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD)-LLM (available as a preprint at the time of the current study), the CANGARU project (ChatGPT and Artificial Intelligence Natural Large Language Models for Accountable Reporting and Use), and the Chatbot Assessment Reporting Tool (CHART), they have not yet been formally published [4, 8, 9, 10].
While one prior brief study examined adherence to one specific MI-CLEAR-LLM checklist item, stochasticity, no study has comprehensively evaluated adherence to the checklist as a whole [11]. Therefore, we utilized the MI-CLEAR-LLM framework to evaluate current reporting practices for LLM-based medical studies. This study aimed to systematically examine how well these essential items were addressed and to provide an objective assessment of adherence to this framework. By identifying gaps and areas for improvement, our study seeks to highlight aspects of reporting practices that can be improved to support better and more reliable research and applications of LLMs in healthcare.
Go to:
MATERIALS AND METHODS
Institutional Review Board approval was waived for this retrospective study because it did not involve human participants. This study was performed according to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses [12, 13] guidelines. The study protocol was registered in PROSPERO (number: CRD42024622870).
Literature Search Strategy and Study Selection
This study adopted the same literature search methodology as a previous study [11] that focused on investigating the reporting practices of stochasticity in original research evaluating the performance of LLMs in medical applications.
A systematic literature search of the PubMed database was conducted to identify research articles on the performance of LLMs in medical applications. The search covered articles published from November 30, 2022 (release date of ChatGPT by OpenAI), through June 25, 2024. The search query used was as follows: “(large language model) OR (chatgpt) OR (gpt-3.5) OR (gpt-4) OR (bard) OR (gemini) OR (claude) OR (chatbot).” By including the term ‘large language model,’ we expected to capture the majority of studies related to LLMs. Given the large number of potential candidate articles, we decided to focus on high-quality publications; accordingly, only original research from journals ranked within the top decile based on the 2023 Journal Impact Factor and those indexed in the Science Citation Index Expanded within the ‘Clinical Medicine’ category of the Journal Citation Reports were selected. An experienced medical librarian (J.Y.) conducted the initial search for candidate articles, after which study selection was performed independently by two reviewers (C.H.S. and J.S.K., with 11 and 3 years of experience in conducting systematic reviews, respectively).
Data Extraction and Analysis
Data were independently extracted by two reviewers: J.S.K. (a neuroradiologist and second-year clinical fellow) and H.H. (a research faculty member with expertise in deep learning development). Data on all six key items of the MI-CLEAR-LLM framework were collected and evaluated for each study [5]. The title, abstract, introduction, methods, results, discussion, and supplementary files of all the articles were evaluated for this purpose. In case of disagreement, consensus was reached through consultation with C.H.S., a neuroradiologist with experience in LLM-based research, who supervised the data extraction process. The proportion of articles adhering to each item was calculated, and the following aspects were analyzed:
*
• Item 1: Identification and specifications of the LLM used, including 1) LLM name, 2) version, 3) manufacturer, 4) cutoff date of training data, 5) whether the LLM had access to web-based information (e.g., retrieval-augmented generation [RAG]), and 6) date of query attempts.
*
• Item 2: How stochasticity was handled. This analysis focused on reverifying whether stochasticity-related issues were clearly reported. Stochasticity-related reporting details have been omitted because they were specifically addressed in a previous study [11].
*
• Item 3: Reporting the full text of prompts, including precise spelling, symbols, punctuation, and spaces.
*
• Item 4: How prompts were employed, including 1) whether each query and corresponding prompt were treated as individual chat sessions or whether multiple queries were processed together in a single session, and 2) whether multiple queries were input simultaneously or sequentially across multiple chat rounds. To categorize LLM access methods, we classified the studies into three types: browser-based public interface access (e.g., ChatGPT), public application programming interface (API) access, and local LLMs or institutional API access.
*
• Item 5: Prompt testing and optimization details, including 1) steps taken to create the prompts, and 2) rationale behind selecting specific wording over alternatives.
*
• Item 6: Whether the test dataset was independent, including 1) whether any portion of the test data was used during model training or prompt optimization, and 2) whether data were sourced from the internet and the exact source URLs were provided.
Additionally, the included studies were categorized as radiology-related or in other fields.
Statistical Analysis
Agreement between the results extracted by the two reviewers were analyzed using Cohen’s kappa (κ) value to assess interrater reliability. A κ-value >0.8 indicated almost perfect agreement, whereas 0.61–0.80, 0.41–0.60, 0.21–0.40, <0.20 indicated substantial, moderate, fair, and poor agreement, respectively [14]. Adherence percentages were calculated. Adherence rates were compared between radiology-related studies and those in other fields using chi-square and Fisher’s exact tests. Statistical significance was set at P < 0.05. All statistical analyses were conducted using SPSS software (version 27.0 for Windows; IBM Corp., Armonk, NY, USA).
Go to:
RESULTS
Literature Search
The systematic search initially yielded 13515 articles. A total of 1149 duplicate studies and 10478 articles in journals not ranked within the top decile of each of the 59 categories under the Journal Citation Reports ‘Clinical Medicine’ group were excluded. The abstracts of the 1888 remaining articles were then screened, of which 1636 were excluded (846 articles that were unrelated to the field of interest, 733 review articles, 42 editorials, 10 surveys, and 5 case series). Full-text reviews of the 252 remaining potentially eligible articles were performed, after which 93 articles were excluded (64 articles unrelated to the field of interest, 19 review articles, 9 editorials, and 1 case series). Finally, 159 original research articles were included in the analysis (Fig. 1). A list of the included articles is provided in Supplementary Table 1.
Fig. 1
Flowchart of the literature search process. LLM = large language model
* Click for larger image
* Download as PowerPoint slide
To ensure transparency, the current study utilized the same dataset as that used in a previous study [11]. However, that study focused exclusively on reporting practices related to stochasticity. By contrast, the current study applied the full MI-CLEAR-LLM checklist, which was not available at the time of the earlier analysis, to provide a comprehensive evaluation of reporting practices across the six key items. This broader approach allowed for a more comprehensive assessment of the reporting quality in LLM-based medical studies.
Characteristics of Included Studies
The subject areas of the studies covered a broad range of medical applications (Table 1). General medicine was the most common field, represented by 54 studies (34.0%, 54/159), followed by radiology and nuclear medicine (11.3%, 18/159) and ophthalmology (8.2%, 13/159). Other subject areas such as oncology, neurosurgery, gastroenterology, urology, neurology, pediatrics, obstetrics and gynecology, emergency medicine, cardiology, orthopedic surgery, and psychiatry were also represented. Eighteen studies were radiology-related and 141 were in other fields.
Table 1
Subject fields of included articles (n = 159)
* Click for larger image
* Click for full table
* Download as Excel file
Adherence to the MI-CLEAR-LLM Checklist
Among the 159 studies analyzed, the number of MI-CLEAR-LLM checklist components that were satisfied by each study varied widely. Specifically, 1 study satisfied only 2 components, whereas the highest number of components satisfied was 11, which was achieved by 2 studies. The distribution of checklist adherence was as follows: 2 components (1 study), 3 components (6 studies), 4 components (9 studies), 5 components (39 studies), 6 components (34 studies), 7 components (30 studies), 8 components (24 studies), 9 components (9 studies), 10 components (5 studies), and 11 components (2 studies). A comparison between radiology-related studies (n = 18) and studies in other fields (n = 141) revealed significant differences for item 1-4 (cutoff date for data used to train the LLM, P = 0.017) and item 3-1 (full text of prompts with exact wording and syntax used, P = 0.01). No statistically significant differences were observed for the other checklist items. The interrater reliability between the two reviewers, as measured by Cohen’s κ, was 0.69, indicating substantial agreement. Details regarding the extent to which the checklist items were addressed are provided in Table 2 and illustrated in Figure 2.
Fig. 2
Proportions of articles that satisfied each item of the MI-CLEAR-LLM checklist (the results for item 2 were cited from Suh et al. [11]). 1-1 = LLM name, 1-2 = version, 1-3 = manufacturer, 1-4 = cutoff date for the data used to train the LLM, 1-5 = whether the LLM had access to web-based information (e.g., RAG), 1-6 = date of querying attempts, 2-1 = how stochasticity was handled, 3-1 = full text of prompts with exact wording and syntax used, 4-1 = whether each query and its corresponding prompts were treated as individual chat sessions or if multiple queries were processed together in a single session, 4-2 = whether the multiple queries were input simultaneously or sequentially across multiple chat rounds, 5-1 = steps taken to create the prompts, 5-2 = rationale behind selecting specific wording over alternatives, 6-1 = whether any portion of the test data was used in the model training or prompt testing and optimization, and 6-2 = if sourced from the internet, the exact URLs where they can be found. The percentage for 6-2 was calculated based on 76 studies that sourced test data from the internet. Otherwise, the denominator was 159 studies. MI-CLEAR-LLM = Minimum Reporting Items for Clear Evaluation of Accuracy Reports of Large Language Models in Healthcare, RAG = retrieval-augmented generation
* Click for larger image
* Download as PowerPoint slide
Table 2
Adherence to the MI-CLEAR-LLM checklist
* Click for larger image
* Click for full table
* Download as Excel file
Item 1: Identification and Specification of the LLM Used
All 159 studies (100%, 159/159) reported the name of the LLM used to ensure clarity regarding which model was used. The version of LLM used was mentioned in 154 studies (96.9%, 154/159). Additionally, 146 studies (91.8%, 146/159) provided the name of the manufacturer, which further enhanced transparency. However, the cutoff date for the training data, which is an important factor in understanding the knowledge scope and limitations of LLMs, was reported in only 86 studies (54.1%, 86/159). Furthermore, only 10 studies (6.3%, 10/159) mentioned whether the LLM had access to web-based information such as RAG. The dates of the query attempts were reported in 81 studies (50.9%, 81/159).
Item 2: How Stochasticity Was Handled
Only 15.1% of the studies (24/159) provided clear documentation of stochasticity-related factors, whereas 84.3% (134/159) did not report the number of query attempts. Furthermore, only 12.7% (20/158, excluding one study that specified a single attempt) included a reliability analysis of the results of repeated queries [11].
Item 3: Full Text of Prompts with Exact Wording and Syntax Used
Of the 159 studies, 78 (49.1%) provided detailed information on at least one element related to the exact wording and syntax of the prompts employed. These elements typically included the following.
*
• Precise spelling: 78 studies (49.1%) ensured that the spelling of each term in the prompts was consistent across multiple query attempts.
*
• Symbols used: 28 studies (17.6%) described the use of special characters or symbols within prompts. These included curly braces for the JavaScript Object Notation (JSON) structure and square brackets as placeholders.
*
• Punctuation: 30 studies (18.9%) included the strategic use of quotation marks, commas, and colons to structure prompt templates.
*
• Spaces: 78 studies (49.1%) focused on the use of spaces, including line breaks.
*
• Other relevant syntax: 30 studies (18.9%) mentioned additional syntactic elements such as capitalization, indentation, or formatting cues (e.g., the use of [ANSWER] placeholders).
However, a large proportion of the studies (50.9%, 81/159) did not provide sufficient details regarding the exact syntax or structure of the prompts utilized, including information on special characters or spacing.
Item 4: Detailed Explanation of How the Prompts Were Specifically Employed
Of the 159 studies, only 54 (34.0%, 54/159) offered explicit descriptions of their query-handling approach. These studies used clear terminology such as “individual sessions,” “separate conversations,” or “continuous dialogue” to describe their LLM interaction structure, indicating whether queries were processed individually or in conjunction with other prompts. Among the studies, 84 (52.8%) specified their LLM access methods: 55 employed public interface access (e.g., ChatGPT), 29 utilized public API access, and 17 used local LLMs or institutional API access.
In addition, 55 studies (34.6%, 55/159) documented their query input methodologies such as simultaneous input (batch processing) or sequential input across multiple chat rounds. In the context of our analysis, “multiple chat rounds” refers to back-and-forth interactions occurring either within a single session or across different dates. When queries were introduced sequentially, they were typically independent of each other, unlike chain prompting, in which subsequent queries build upon previous responses. For example, a researcher may first inquire about disease symptoms, followed by an independent question about treatment options, with each query standing alone rather than building upon previous responses.
Item 5: Whether Prompt Testing and Optimization Were Used and, If so, Their Details
Among the 159 studies examined, 74 (46.5%, 74/159) provided insights into the steps taken to create and refine their prompts. These studies often discussed methodologies such as “prompt engineering,” “iterative development,” or “prompt design processes,” detailing how prompts were tested and optimized for improved results.
In contrast, only 25 studies (15.7%, 25/159) addressed the rationale behind selecting a specific wording over alternatives. These studies explained why particular words, phrases, or syntax were often chosen through comparative analyses of different prompt versions.
Item 6: Whether the Test Data Were Independent of the Model’s Training Data
Only 21 studies (13.2%, 21/159) clearly reported whether any portion of the test data had been used during model training or prompt optimization, ensuring that the evaluation results were not biased by prior exposure to the test data. Additionally, 43 (56.6%) of the 76 studies that sourced test data from the internet included the exact URLs from which the data were obtained.
Go to:
DISCUSSION
This study assessed reporting practices in clinical research utilizing LLMs, focusing on key elements such as model identification, prompt usage, query handling, and dataset independence. As highlighted in a previous review, reproducibility issues in AI research often stem from incomplete documentation and variations in model performance across different runs, even when using the same code, owing to factors such as random initialization and specific training parameters [15]. Our analysis revealed considerable variation in reporting practices across studies, with some aspects being consistently reported, while others were characterized by notable gaps in documentation, reflecting the broader challenges in AI reproducibility [16].
All the studies analyzed (159/159) clearly identified the LLM used by name, and a high percentage (96.9%) reported the version number, allowing readers to accurately trace the specific models used. In addition, the manufacturer was identified in 91.8% of the studies, reflecting a strong trend in reporting basic technical details. However, only 54.1% of the studies included the training data cutoff date, which is a critical element for understanding the relevance of the results obtained using a model. Mitchell et al. [17] emphasized that such temporal boundaries are essential components of model documentation and significantly influence the interpretation of the results. Furthermore, only 6.3% of the studies reported whether the LLM had access to web-based information (i.e., RAG), and 50.9% mentioned the dates on which the model was queried. These specifications are crucial for ensuring the reproducibility and validation of machine learning research results [18].
Only 15.1% of the studies provided clear documentation of stochasticity-related factors, as previously reported [11]. This finding indicates a significant deficiency in the documentation of stochasticity-related factors in studies assessing LLM performance in medical applications. This underscores the urgent need to improve transparency and detailed reporting of stochasticity, ideally through stricter adherence to established reporting checklists.
Only 49.1% of the studies provided detailed information regarding the specific wording and syntax of the prompts used. While traditional journal space limitations may constrain the inclusion of comprehensive prompt details in the main text, Belz et al. [19] demonstrated that supplementary materials provide an ideal platform for documenting critical information in natural language processing research. The importance of complete prompt documentation aligns with the reproducibility standards proposed by Heil et al. [18], which emphasizes the need for comprehensive methodological documentation in machine learning studies.
Some studies provided sufficient information on how prompts were employed, such as whether they were processed as individual chat sessions or as part of a continuous session (34.0%) or whether multiple queries were input simultaneously or sequentially (34.6%). As identified in a previous review, these technical details significantly influence how AI systems interact with inputs and consequently affect study outcomes [16]. Underreporting of session structures and query input methods creates significant challenges for the reproducibility of results. Andaur Navarro et al. [20] found systematic issues in reporting standards across machine learning-based prediction studies, highlighting the need for more rigorous documentation of methodological details, including protocols for interaction with AI systems.
While 46.5% of the studies detailed the steps taken to create and refine prompts, only 15.7% explained the rationale behind specific word choices. This finding aligns with broader concerns about the reproducibility of AI research [15]. Our analysis indicates that researchers should provide comprehensive documentation of their prompt testing processes as supplementary material, following the documentation standards proposed by Mitchell et al. [17].
Only 13.2% of the studies explicitly reported whether the test data used were independent of the training data, and 56.6% provided URLs for internet-derived test data. This lack of transparency regarding test data independence significantly complicates the evaluation of LLM performance [21]. Paullada et al. [22] further emphasized that clear documentation of data sources and their independence are crucial for establishing the reliability of the reported results. The absence of clear documentation regarding the potential overlap between the training and test data makes it difficult to assess whether the model truly demonstrates generalization capabilities. This concern echoes the findings of Andaur Navarro et al. [20] regarding systematic issues in reporting the standards and potential biases in machine learning studies. A recent critical assessment of 23 state-of-the-art LLM benchmarks highlighted key limitations, including biases, difficulties in measuring genuine reasoning, and risks of contamination between the training and evaluation processes [23, 24]. Such issues emphasize the need for standardized methodologies and dynamic evaluation frameworks that can better capture the complex behavior of LLMs while mitigating potential biases.
This study aimed to analyze adherence to the MI-CLEAR-LLM checklist, focusing specifically on methodological elements unique to LLM research. Although LLM studies potentially encompass a broader range of evaluation dimensions, the MI-CLEAR-LLM checklist emphasizes the essential items necessary for assessing the clinical accuracy and reliability of LLM-based evaluations. A recent systematic review [25] highlighted that most LLM evaluations focus on assessing knowledge using sources such as the United States Medical Licensing Examination (USMLE) test questions, with limited attention paid to real clinical data and issues such as fairness, bias, toxicity, and deployment considerations. Although it is possible to conduct a more extensive evaluation of LLM research, the MI-CLEAR-LLM checklist concentrates on specific methodological reporting elements to ensure reproducibility and transparency. Thus, although this study’s focus on LLM-specific reporting elements could be considered a limitation, it was beyond the scope of our study, which aimed to address the unique features of LLM methodologies in healthcare applications.
In summary, this analysis revealed significant gaps in how key aspects of LLM-related research are currently reported, particularly in areas such as stochasticity handling, prompt usage, and test data independence. While the technical specifications of LLMs, such as the model name and version, are generally well documented, other methodological details affecting LLM performance are often overlooked. This hinders the reproducibility of studies and limits our understanding of LLMs. By improving the transparency of LLM research, the broader scientific community can foster a deeper understanding of these powerful models and ensure that future studies are conducted with greater credibility and reliability.
Go to:
Supplement
The Supplement is available with this article at https://doi.org/10.3348/kjr.2024.1161.
Click here to view.(189K, pdf)
1. 1. Gilson A, Safranek CW, Huang T, Socrates V, Chi L, Taylor RA, et al. How does ChatGPT perform on the United States medical licensing examination (USMLE)? The implications of large language models for medical education and knowledge assessment. JMIR Med Educ 2023;9:e45312 * PubMed * CrossRef 2. 1. Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health 2023;2:e0000198 * PubMed * CrossRef 3. 1. Suh PS, Shim WH, Suh CH, Heo H, Park CR, Eom HJ, et al. Comparing diagnostic accuracy of radiologists versus GPT-4V and Gemini Pro Vision using image inputs from diagnosis please cases. Radiology 2024;312:e240273 * PubMed * CrossRef 4. 1. CHART Collaborative. Protocol for the development of the Chatbot assessment reporting tool (CHART) for clinical advice. BMJ Open 2024;14:e081155 * CrossRef 5. 1. Park SH, Suh CH, Lee JH, Kahn CE, Moy L. Minimum reporting items for clear evaluation of accuracy reports of large language models in healthcare (MI-CLEAR-LLM). Korean J Radiol 2024;25:865–868. * PubMed * CrossRef 6. 1. Park SH, Suh CH. Reporting guidelines for artificial intelligence studies in healthcare (for both conventional and large language models): what’s new in 2024. Korean J Radiol 2024;25:687–690. * PubMed * CrossRef 7. 1. Huo B, Cacciamani GE, Collins GS, McKechnie T, Lee Y, Guyatt G. Reporting standards for the use of large language model-linked chatbots for health advice. Nat Med 2023;29:2988 * PubMed * CrossRef 8. 1. Tejani AS, Yi P, Bluethgen C, D’Antonoli TA, Huisman M. Best practices in large language model reporting. [accessed on November 5, 2024]. Available at: https://pubs.rsna.org/page/ai/blog/2024/9/ryai_ editorsblog093024?doi=10.1148%2Fryai&publicationCode=ai. 9. 1. Gallifant J, Afshar M, Ameen S, Aphinyanaphongs Y, Chen S, Cacciamani G, et al. The TRIPOD-LLM statement: a targeted guideline for reporting large language models use. medRxiv [Preprint]. 2024 [accessed on November 10, 2024]. Available at: https://doi.org/10.1101/2024.07.24.24310930. 10. 1. Cacciamani GE, Collins GS, Gill IS. ChatGPT: standard reporting guidelines for responsible use. Nature 2023;618:238 * PubMed * CrossRef 11. 1. Suh CH, Yi J, Shim WH, Heo H. Insufficient transparency in stochasticity reporting in large language model studies for medical applications in leading medical journals. Korean J Radiol 2024;25:1029–1031. * PubMed * CrossRef 12. 1. Page MJ, McKenzie JE, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ 2021;372:n71 * PubMed * CrossRef 13. 1. Park HY, Suh CH, Woo S, Kim PH, Kim KW. Quality reporting of systematic review and meta-analysis according to PRISMA 2020 guidelines: results from recently published papers in the Korean Journal of Radiology. Korean J Radiol 2022;23:355–369. * PubMed * CrossRef 14. 1. Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics 1977;33:159–174. * PubMed * CrossRef 15. 1. Hutson M. Artificial intelligence faces reproducibility crisis. Science 2018;359:725–726. * PubMed * CrossRef 16. 1. Gundersen OE, Kjensmo S. State of the art: reproducibility in artificial intelligence. [accessed on November 10, 2024]. Available at: https://doi.org/10.1609/aaai.v32i1.11503. 17. 1. Mitchell M, Wu S, Zaldivar A, Barnes P, Vasserman L, Hutchinson B, et al. Model cards for model reporting. [accessed on November 10, 2024]. Available at: https://doi.org/10.1145/3287560.3287596. 18. 1. Heil BJ, Hoffman MM, Markowetz F, Lee SI, Greene CS, Hicks SC. Reproducibility standards for machine learning in the life sciences. Nat Methods 2021;18:1132–1135. * PubMed * CrossRef 19. 1. Belz A, Agarwal S, Shimorina A, Reiter E. A systematic review of reproducibility research in natural language processing. arXiv [Preprint]. 2021 [accessed on November 10, 2024]. Available at: https://doi.org/10.48550/arXiv.2103.07929. 20. 1. Andaur Navarro CL, Damen JAA, Takada T, Nijman SWJ, Dhiman P, Ma J, et al. Systematic review finds “spin” practices and poor reporting standards in studies on machine learning-based prediction models. J Clin Epidemiol 2023;158:99–110. * PubMed * CrossRef 21. 1. Kapoor S, Narayanan A. Leakage and the reproducibility crisis in ML-based science. arXiv [Preprint]. 2022 [accessed on November 10, 2024]. Available at: https://doi.org/10.48550/arXiv.2207.07048. 22. 1. Paullada A, Raji ID, Bender EM, Denton E, Hanna A. Data and its (dis)contents: a survey of dataset development and use in machine learning research. Patterns (N Y) 2021;2:100336 * PubMed * CrossRef 23. 1. McIntosh TR, Susnjak T, Arachchilage N, Liu T, Watters P, Halgamuge MN. Inadequacies of large language model benchmarks in the era of generative artificial intelligence. arXiv [Preprint]. 2024 [accessed on December 15, 2024]. Available at: https://doi.org/10.48550/arXiv.2402.09880. * PubMed 24. 1. Jegorova M, Kaul C, Mayor C, O’Neil AQ, Weir A, Murray-Smith R, et al. Survey: leakage and privacy at inference time. IEEE Trans Pattern Anal Mach Intell 2023;45:9090–9108. * PubMed 25. 1. Bedi S, Liu Y, Orr-Ewing L, Dash D, Koyejo S, Callahan A, et al. Testing and evaluation of health care applications of large language models: a systematic review. JAMA. 2024 Oct 15; [doi: 10.1001/jama.2024.21700] [Epub].
Ji Su Ko
Department of Radiology and Research Institute of Radiology, University of Ulsan College of Medicine, Asan Medical Center, Seoul
Hwon Heo
Department of Medical Science, Asan Medical Institute of Convergence Science and Technology, University of Ulsan College of Medicine, Asan Medical Center, Seoul
Chong Hyun Suh
Department of Radiology and Research Institute of Radiology, University of Ulsan College of Medicine, Asan Medical Center, Seoul
Jeho Yi
Asan Medical Library, University of Ulsan College of Medicine, Seoul
Woo Hyun Shim
Department of Radiology and Research Institute of Radiology, University of Ulsan College of Medicine, Asan Medical Center, Seoul
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2025. This work is published under https://creativecommons.org/licenses/by-nc/4.0 (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.