Comparing scientific abstracts generated by

Abstract

Large language models such as ChatGPT can produce increasingly realistic text, with unknown information on the accuracy and integrity of using these models in scientific writing. We gathered fifth research abstracts from five high-impact factor medical journals and asked ChatGPT to generate research abstracts based on their titles and journals. Most generated abstracts were detected using an AI output detector, ‘GPT-2 Output Detector’, with % ‘fake’ scores (higher meaning more likely to be generated) of median [interquartile range] of 99.98% ‘fake’ [12.73%, 99.98%] compared with median 0.02% [IQR 0.02%, 0.09%] for the original abstracts. The AUROC of the AI output detector was 0.94. Generated abstracts scored lower than original abstracts when run through a plagiarism detector website and iThenticate (higher scores meaning more matching text found). When given a mixture of original and general abstracts, blinded human reviewers correctly identified 68% of generated abstracts as being generated by ChatGPT, but incorrectly identified 14% of original abstracts as being generated. Reviewers indicated that it was surprisingly difficult to differentiate between the two, though abstracts they suspected were generated were vaguer and more formulaic. ChatGPT writes believable scientific abstracts, though with completely generated data. Depending on publisher-specific guidelines, AI output detectors may serve as an editorial tool to help maintain scientific standards. The boundaries of ethical and acceptable use of large language models to help scientific writing are still being discussed, and different journals and conferences are adopting varying policies.

Details

Title

Comparing scientific abstracts generated by ChatGPT to real abstracts with detectors and blinded human reviewers

Author

Gao, Catherine A.¹

; Howard, Frederick M.²

; Markov, Nikolay S.¹

; Dyer, Emma C.²

; Ramesh, Siddhi²; Luo, Yuan³

; Pearson, Alexander T.²

¹ Northwestern University Feinberg School of Medicine, Division of Pulmonary and Critical Care, Department of Medicine, Chicago, USA (GRID:grid.16753.36) (ISNI:0000 0001 2299 3507)
² University of Chicago, Section of Hematology/Oncology, Department of Medicine, Chicago, USA (GRID:grid.170205.1) (ISNI:0000 0004 1936 7822)
³ Northwestern University Feinberg School of Medicine, Division of Health and Biomedical Informatics, Department of Preventive Medicine, Chicago, USA (GRID:grid.16753.36) (ISNI:0000 0001 2299 3507)

Pages

Publication year

2023

Publication date

Dec 2023

Publisher

Nature Publishing Group

e-ISSN

23986352

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1038/s41746-023-00819-6

ProQuest document ID

2806315997

© The Author(s) 2023. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Comparing scientific abstracts generated by ChatGPT to real abstracts with detectors and blinded human reviewers

Jump to:

Abstract

Details

Suggested sources