Introduction
Chat-Generative Pre-Trained Transformer (ChatGPT) is an artificial intelligence (AI) program that became available in November 2022 and creates text based on written prompts. This program has gained immense popularity for its web-based accessibility via OpenAI (OpenAI OpCo, LLC, San Francisco, CA). [1]. ChatGPT is a natural language processing model with 175 billion parameters that uses deep learning algorithms to generate human-like responses. As a versatile conversational agent, it can handle a variety of topics, making it useful for customer service, chatbots, and other applications [2]. ChatGPT offers next-generation models that can better combine clinical knowledge and interaction with dialogs. Its unique narrative interface enables innovative applications such as a patient simulation, a brainstorming tool, or a student companion [2]. Despite not being trained in any way, ChatGPT's performance that passes or is close to passing the United States Medical Licensing Examination (USMLE) shows its usability in the healthcare field. ChatGPT has a wide variety of uses in the field of dentistry. These include dentistry telemedicine services, supporting the clinical decisions made by the dentist, contributing to the education of dentistry students, and assisting in writing scientific articles, scientific evaluations, and patient information [3]. Large language model-based apps that have been tailored and refined have the potential to enhance dental telemedicine services [4]. In the near future, numerous language model programs are expected to possess the capability to efficiently acquire patient data, analyze symptoms, and generate preliminary diagnoses, which can subsequently be reviewed by a human dentist. Particularly in underprivileged areas with limited access to dental care, broad language models may prove more helpful in dental telemedicine services.
The advent of the internet and the subsequent development of search engines equipped with large language models have significantly influenced the accessibility of health-related patient information. By using ChatGPT, patients can find answers to their medical queries without having to go through numerous websites. In addition to having so many positive aspects, ChatGPT also has serious disadvantages. With the effect of ChatGPT's high self-confidence, it is quite difficult to confirm the accuracy and completeness of its information [3]. For users to trust this AI program, ChatGPT must perform similarly to humans in assessing medical knowledge and judgment [5]. ChatGPT is not suitable for use in support of clinical decisions due to the potential for misinformation or insufficient information and because it was not originally designed to provide medical guidance [6]. Therefore, the aim of this study is to evaluate the accuracy and completeness of the answers given by ChatGPT to the most frequently asked questions on different topics in the field of periodontology.
Materials and methods
Study design and data source
The study group consisted of dental professionals specializing in periodontology in Turkey. Participation in the study was voluntary, and the consent of the individuals who agreed to participate in the study was obtained. In addition, the professional experience of all periodontology specialists who will make the evaluation was selected to be four to six years in the field of periodontology. In this way, the perspectives of the experts included in the study were standardized, as they had similar up-to-date knowledge in the field of periodontology. To ensure the calibration of the 20 selected periodontists for evaluating the responses, a systematic process is used. Prior to starting the evaluation, it was ensured that all 20 periodontists were familiar with the evaluation criteria and instructions. Training was given to those who were not familiar with it. A subject unrelated to the study topics, namely gingival enlargements, was selected for the purpose of conducting a training phase. During this phase, participants were instructed to assess and evaluate five different responses to questions pertaining to the chosen subject. The findings were subsequently deliberated together in order to detect any inconsistencies or disparities in their evaluations. Then, the evaluators were instructed to independently assess the responses to the questions by participating in many re-readings at their discretion over a period of one month.
The study was performed at Necmettin Erbakan University, Faculty of Dentistry, Konya, Turkey. The study was approved by the Necmettin Erbakan University, Faculty of Dentistry Scientific Research Ethics Committee (ref. 2023/294). The study was carried out in June 2023 using ChatGPT and was completed when the number of forms filled out by 20 periodontists was reached. No actual patient information was used in this study.
Seven different main topics on which patients can ask basic questions about periodontology were determined by periodontists. The ChatGPT program was asked to create the 10 most frequently asked questions by patients about each of the seven main topics in the field of periodontology. These topics consist of periodontal diseases, peri-implant diseases, tooth sensitivity, gingival recessions, halitosis, dental implants, and periodontal surgery (Figure 1).
Figure 1
The questions asked on seven different topics within the field of periodontology.
Each of the 70 questions in total created by ChatGPT was asked to be answered by ChatGPT by opening a new chat window and using the same current version of the ChatGPT program. All generated questions and their answers were recorded (Appendix A). Throughout the use of the ChatGPT program, only the English language was used. Experienced periodontists were contacted via email to evaluate the answers given by ChatGPT. Covariates of the periodontologists were recorded, including their gender, age, professional experience, and previous experience with artificial intelligence.
Evaluation of the responses to each question
Accuracy and completeness were assessed with separate Likert scales for each question [7]. Before the examination, the experts were instructed to assess the completeness of the response before assessing its accuracy.
The Accuracy Likert Scale
1: completely incorrect; 2: more incorrect than correct; 3: approximately equal correct and incorrect; 4: more correct than incorrect; 5: nearly all correct; 6: correct.
The Completeness Likert Scale
1: incomplete; addresses some aspects of the question, but significant parts are missing or incomplete; 2: adequate, addresses all aspects of the question, and provides the minimum amount of information required to be considered complete; 3: comprehensive, addresses all aspects of the question, and provides additional information or context beyond what was expected.
Statistical analysis
The data were evaluated using the IBM SPSS software version 23.0 (IBM Corp., Armonk, NY). The sample size of the study was within the 90% confidence interval, the marginal error was calculated as 0.05, and a total of 20 people were included in the study. Whether the data were normally distributed or not was examined with the Kolmogorov-Smirnov test. The Mann-Whitney U test and Kruskal-Wallis H test were used in the analysis of quantitative variables, and the chi-square test was used in the analysis of qualitative variables. The relationship between quantitative variables was evaluated by logistic regression analysis. Spearman's rho correlation coefficients were used to examine relationships between accuracy scores and completeness scores. A P-value <.05 was accepted as statistically significant.
Results
A total of 70 questions were created by ChatGPT, 10 questions from each, related to seven main periodontology subjects, and the answers to each of them are presented in Appendix A. The gender, age, and AI usage experiences of 20 periodontology specialists participating in the study are given in Table 1.
Table 1
Demographic data of experts in the field of periodontology evaluating the answers given by ChatGPT
SD: standard deviation
Age (years) | Artificial intelligence experience | |||||
Gender | Mean | SD | P-value | Yes | No | P-value |
Female (N=12) | 31.83 | 3.88 | 0.564 | 1 | 11 | 0.255 |
Male (N=8) | 32.75 | 3.05 | 3 | 5 |
There was no statistically significant difference between the use of artificial intelligence by gender and age.
The distribution of accuracy and completeness scores on the Likert scale of the answers given to the questions according to topic headings is presented in Table 2.
Table 2
Distribution of accuracy and completeness scores for subjects
Data are represented as n (%).
Accuracy scores | Completeness scores | |||||||||
Completely incorrect (1) | More incorrect than correct (2) | Approximately equal correct and incorrect (3) | More correct than incorrect (4) | Nearly all correct (5) | Correct (6) | Incomplete (1) | Adequate (2) | Comprehensive (3) | ||
Periodontal disease | 0 (0) | 0 (0) | 0 (0) | 5 (2.5) | 74 (37) | 121 (60.5) | 13 (6.5) | 114 (57) | 73 (36.5) | |
Peri-implant disease | 0 (0) | 0 (0) | 0 (0) | 10 (5) | 48 (24) | 142 (71) | 23 (11.5) | 94 (47) | 83 (41.5) | |
Tooth sensitivity | 0 (0) | 8 (4) | 0 (0) | 12 (6) | 61 (30.5) | 119 (59.5) | 24 (12) | 80 (40) | 96 (48) | |
Gingival recession | 0 (0) | 7 (3.5) | 5 (2.5) | 18 (9) | 86 (43) | 84 (42) | 30 (15) | 96 (48) | 74 (37) | |
Halitosis | 0 (0) | 0 (0) | 4 (2) | 4 (2) | 47 (23.5) | 145 (72.5) | 16 (8) | 95 (47.5) | 89 (44.5) | |
Dental implants | 0 (0) | 4 (2) | 0 (0) | 14 (7) | 72 (36) | 110 (55) | 22 (11) | 114 (57) | 64 (32) | |
Periodontal surgery | 0 (0) | 0 (0) | 0 (0) | 8 (4) | 66 (33) | 126 (63) | 14 (7) | 114 (57) | 72 (36) |
Each of the answers to the periodontology questions that could be asked by the patients was evaluated for accuracy and completeness. Table 3 presents the statistical measures of the scores obtained in the seven subjects, including the median, mode, and mean values.
Table 3
Accuracy and completeness scores for AI-generated answers to questions on key topics in periodontology
SD: standard deviation
Accuracy | Completeness | P-value | ||
Periodontal disease | Median (min.-max.) | 6 (4-6) | 2 (1-3) | <0.001 |
Mode | 6 | 2 | ||
Mean±SD | 5.59 ± .19 | 2.31 ± .2 | ||
Peri-implant disease | Median (min.-max.) | 6 (4-6) | 2 (1-3) | <0.001 |
Mode | 6 | 2 | ||
Mean±SD | 5.66 ± .23 | 2.29 ± .36 | ||
Tooth sensitivity | Median (min.-max.) | 6 (2-6) | 2 (1-3) | <0.001 |
Mode | 6 | 3 | ||
Mean±SD | 5.41 ± .31 | 2.36 ± .23 | ||
Gingival recession | Median (min.-max.) | 5 (2-6) | 2 (1-3) | <0.001 |
Mode | 5 | 2 | ||
Mean±SD | 5.21 ± .43 | 2.61 ± .25 | ||
Halitosis | Median (min.-max.) | 6 (3-6) | 2 (1-3) | <0.001 |
Mode | 6 | 2 | ||
Mean±SD | 5.67 ± .31 | 2.37 ± .34 | ||
Dental implants | Median (min.-max.) | 6 (2-6) | 2 (1-3) | <0.001 |
Mode | 6 | 2 | ||
Mean±SD | 5.42 ± .07 | 2.21 ± .23 | ||
Periodontal surgery | Median (min.-max.) | 6 (4-6) | 2 (1-3) | <0.001 |
Mode | 6 | 2 | ||
Mean±SD | 5.59 ± .31 | 2.29 ± .24 |
Among the seven main topics we chose in periodontology, halitosis had the highest accuracy score (5.67±0.31), while the lowest accuracy was found for gingival recessions (5.21±0.43). When the completeness scores were evaluated among the subjects, the highest score was in gingival recession (2.61±0.25), while the lowest score was in dental implants (2.21±0.23).
The differences in accuracy and completeness between subjects were evaluated with the Kruskal-Wallis test. Accuracy and completeness among all subjects were evaluated separately with the Mann-Whitney U test for statistical differences.
A statistically significant difference was observed in both accuracy and completeness scores during the inter-subject examination, with a P-value of less than 0.05. The accuracy of the answers given to the topic of gingival recession was statistically lower than all other topics, except for "tooth sensitivity" and "dental implants" (P<0.05). There was no statistically significant difference in accuracy when comparing the subjects of "gingival recession," "tooth sensitivity," and "dental implants" (P>0.05). The statistical analysis revealed that the score of completeness in the responses provided for the subject of gingival recession was significantly higher compared to all other subjects, with the exceptions of "tooth sensitivity" and "halitosis" (P<0.05). There was no statistically significant difference in terms of completeness between the subjects "gingival recession", "tooth sensitivity", and "halitosis" (P>0.05).
There was a moderate correlation in accuracy and completeness scores among all subjects (Spearman’s r=0.46, 95% CI 0.3 to 0.5, P<0.01).
Discussion
Given the public accessibility of ChatGPT, it is clear that patients can often ask questions to large language models about their medical condition in the first stage. Our study focused on the most frequently asked questions by patients and the quality of the answers given by artificial intelligence. To the best of our knowledge, this study is the first in the field of periodontology to evaluate the information provided by artificial intelligence chatbots. For this reason, there is no study to compare the numerical data in the results.
With the development of technology, the first place that patients ask about a subject they are curious about is the internet. Various studies have been conducted on the information accuracy and quality of search engines on the Internet and videos on YouTube [8]. A recent study by Beaumont et al. investigated framing in online patient information for people newly diagnosed with periodontitis [9]. Their work was a cross-sectional analysis of websites that came up with a Google search for the term 'gum disease' using linguistic techniques. The results showed that the fear of patients due to negative framing can increase the desire to be treated and self-care. However, the risks of patient information systems were emphasized by showing that negative situations could frighten and demotivate the patient even more. Bizzi et al. evaluated the quality of information available on the Web about gum disease according to the Journal of the American Medical Association criteria [10]. As a result, they showed that Google can use sites with high-quality information for patient information purposes [10].
Numerous studies have been conducted in the field of health, especially in medical education, in which the performance of ChatGPT was evaluated [11, 12]. Measuring AI medical knowledge compared to specialist clinicians is a critical step in assessing these qualifications. In a study by Kung et al., the performance of ChatGPT, a language-based AI, in the United States Medical Licensing Examination (USMLE) was evaluated [13]. ChatGPT was found to perform at or near the 60% accuracy threshold on this exam, which is required for obtaining a medical license in the United States [13, 14]. The fact that an AI program reaches this criterion without any special training shows that AI developments have come to an extremely advanced level. Huh's study aimed to compare the interpretation and knowledge of ChatGPT compared to medical students in Korea [5]. Their study explained that ChatGPT's ability to understand and interpret parasitology exam questions is still not at the same level as that of Korean medical students. On the other hand, Gilson et al. aimed to investigate and observe ChatGPT's performance in the USMLE [15]. The authors reported that the model performed over 60% on the National Board of Medical Examiners (NBME)-Free-Step-1 dataset, achieving a passing score for a medical student in the third year [15]. Sabry et al. analyzed ChatGPT using a clinical toxicology case of acute organophosphate poisoning. As reported by the authors, ChatGPT was able to answer all questions regarding the case introduced [16]. Yeo et al. examined the accuracy and reproducibility of ChatGPT when answering questions about information, disease management, and emotional support for cirrhosis and hepatocellular carcinoma [17]. ChatGPT's answers to 164 questions were evaluated by two transplant hepatologists and one reviewer. As a result, it has been shown that ChatGPT may have a role as an additional information tool for patients and physicians to improve outcomes [17].
Although it has been demonstrated in numerous studies that ChatGPT is unsuccessful in providing technical responses, it is effective in providing replies to inquiries intended to provide information to patients [7]. There are very few studies that have been published in the literature that corroborate the findings of our research [18].
Johnson et al. In a study they conducted, ChatGPT's answers to 284 medical questions in 17 different specialties were evaluated by experts [7]. They evaluated with two different Likert scales on accuracy and completeness. The mean accuracy score for all questions in the study (n=284) was 5.5, which was between almost completely and completely correct, and the median completeness score was three, which was complete and comprehensive [7].
In a study conducted by Balel, questions about oral and maxillofacial surgery, which were created by the author himself, were asked to ChatGPT, and the answers given by ChatGPT were evaluated by oral, dental, and maxillofacial surgery specialists [18]. As a result, ChatGPT's answers to patient-focused questions were found to be accurate and helpful by experts [18]. In addition, they did not find a statistical difference between the topics in oral and maxillofacial surgery. Similarly, in our study, although the answers given by ChatGPT were "nearly all correct" in terms of accuracy, statistical differences were found between the subjects in terms of accuracy. The topic of gingival recessions had the highest completeness score in terms of information, although it had the lowest accuracy score. This indicates that the greater the amount of information ChatGPT gives, the greater the risk of giving false information.
It is also useful to draw attention to some points regarding the questions in our study. The answers to some of the questions asked on some specific topics, such as "dental implants and peri-implant diseases," are still controversial or unknown in the literature. The following five examples of these questions can be given: 1. How long do dental implants last? 2. What is the success rate of dental implants? 3. What causes peri-implant diseases? 4. Can peri-implant diseases be prevented? 5. How are peri-implant diseases treated?
The answers to these questions are still being investigated by various studies, and the answers are not clear. However, it was included in our study because it was thought that patients could also ask these questions. When ChatGPT's answers to these questions were examined, it was observed that ChatGPT also said that there were no definitive answers to the questions. ChatGPT stated that how long dental implants can remain in the mouth or their success rate may depend on many factors and did not provide clear information. Although there are debates in the literature, more definitive answers have been given to questions about the causes of peri-implant diseases, how they can be prevented, and how they are treated. Therefore, it is essential to conduct further research on large language models such as ChatGPT to evaluate the reliability of information in the field of healthcare.
The present study is subject to multiple limitations. One of the limitations of our study is the small number of periodontology professionals who assessed the responses. Furthermore, the scope of our study was limited to the examination of queries related exclusively to the field of periodontology in dentistry, which restricts the overall applicability of our evaluation to ChatGPT.
Conclusions
Artificial intelligence technology is developing very rapidly. Even though ChatGPT cannot provide 100% accurate and comprehensive findings without expert oversight, it is evident that patients in the field of periodontology can still use it for informational purposes by accepting some error risks. There is a need for more research on the application of AI chatbots in the field of periodontology. This involves the inclusion of additional assessors and an examination of different question forms.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Copyright © 2023, Babayiğit et al. This work is published under https://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
Objectives
The aim of this study is to evaluate the accuracy and completeness of the answers given by Chat Generative Pre-trained Transformer (ChatGPT) (OpenAI OpCo, LLC, San Francisco, CA), to the most frequently asked questions on different topics in the field of periodontology.
Methods
The 10 most frequently asked questions by patients about seven different topics (periodontal diseases, peri-implant diseases, tooth sensitivity, gingival recessions, halitosis, dental implants, and periodontal surgery) in periodontology were created by ChatGPT. To obtain responses, a set of 70 questions was submitted to ChatGPT, with an allocation of 10 questions per subject. The responses that were documented were assessed using two distinct Likert scales by professionals specializing in the subject of periodontology. The accuracy of the responses was rated on a Likert scale ranging from one to six, while the completeness of the responses was rated on a scale ranging from one to three.
Results
The median accuracy score for all responses was six, while the completeness score was two. The mean scores for accuracy and completeness were 5.50 ± 0.23 and 2.34 ± 0.24, respectively. It was observed that ChatGPT's responses to the most frequently asked questions by patients for information purposes in periodontology were at least "nearly completely correct" in terms of accuracy and "adequate" in terms of completeness. There was a statistically significant difference between subjects in terms of accuracy and completeness (P<0.05). The highest and lowest accuracy scores were peri-implant diseases and gingival recession, respectively, while the highest and lowest completeness scores were gingival recession and dental implants, respectively.
Conclusions
The utilization of large language models has become increasingly prevalent, extending its applicability to patients within the healthcare domain. While ChatGPT may not offer absolute precision and comprehensive results without expert supervision, it is apparent that those within the field of periodontology can utilize it as an informational resource, albeit acknowledging the potential for inaccuracies.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer