Content area
Aims
This study aimed to evaluate the performance of publicly available large language models (LLMs), ChatGPT-4o, ChatGPT-4o Mini and Perplexity AI, in responding to research-related questions at the undergraduate nursing level. The evaluation was conducted across different platforms and prompt structures. The research questions were categorized according to Bloom’s taxonomy, to compare the quality of AI-generated responses across cognitive levels. Additionally, the study explored the perspectives of research members on using AI tools in teaching foundational research concepts to undergraduate nursing students.
Background
Large Language Models (LLMs) could help nursing students learn foundational research concepts but their performance in answering research-related questions has not been explored.
Design
An exploratory case study was conducted to evaluate the performance of ChatGPT-4o, ChatGPT-4o Mini and Perplexity AI in answering 41 research-related questions.
Methods
Three different prompts (Prompt-1: Unstructured with no context; Prompt-2: Structured from professor’s perspective; Prompt-3: Structured from student’s perspective) were tested. A 5-point Likert-type valid author-developed scale was used to assess all AI-generated responses across six domains: Accuracy, Relevance, Clarity & Structure, Examples Provided, Critical Thinking and Referencing.
Results
All three AI models generated higher-quality responses when structured prompts were used compared with unstructured prompts and responded well across the different Bloom’s taxonomy levels. ChatGPT-4o and ChatGPT-4o Mini performed better at answering research-related questions than Perplexity AI.
Conclusion
AI models hold promise as supplementary tools for enhancing undergraduate nursing students’ understanding of foundational research concepts. Further studies are warranted to evaluate their impact on specific research-related learning outcomes within nursing education.
1 Introduction
Research literacy is a core competency in nursing education, enabling nurses to effectively implement evidence-based practice ( Brunt and Morris, 2025; Hines et al., 2016). Evidence-based practice can improve patient health outcomes ( Melnyk et al., 2023), lower rates of medical errors ( Sonğur et al., 2018) and is the gold standard for clinical protocols ( Duff et al., 2020). Nurses with a strong foundation in research concepts can critically analyze scientific evidence, successfully translate theory into practice and apply evidence-based practices more effectively ( Leach et al., 2016; Wakibi et al., 2021). Consequently, having a strong research foundation is essential for undergraduate nursing students to deliver high-quality and evidence-based care ( Wakibi et al., 2021). Nursing students are generally positive about learning research skills ( Li et al., 2024; Ross and Burrell, 2019), but they struggle to acquire them due to challenges such as inability to understand abstract research methodologies ( Alruwaili et al., 2025), inadequate exposure to research ( Corbett et al., 2023; ÜNver et al., 2018) and limited individualized mentorship ( Alruwaili et al., 2025). Therefore, additional support is needed to enhance research literacy among undergraduate nursing students.
The current generation of nursing students prefer convenient and pragmatic learning approaches together with rapid feedback ( Chunta et al., 2021; Williams, 2019). The emergence of artificial intelligence (AI), particularly large language models (LLMs) such as ChatGPT ( OpenAI, launched, 2022), aligns with these evolving learning preferences and presents new opportunities in nursing education. A key AI advancement is LLMs, which generate natural language responses through intuitive conversational interfaces which enhances human-computer interaction ( Dwivedi et al., 2023; OpenAI, 2022). LLMs can support undergraduate nursing education by providing rapid, detailed and personalized feedback to meet students’ learning preferences (J. Li et al., 2024; Montenegro-Rueda et al., 2023; Sallam, 2023). Hence, integrating LLMs as teaching tools could enhance nursing education by promoting efficient and adaptive learning experiences ( Gunawan, Aungsuroch, and Montayre, 2024).
Previous research on LLMs in nursing education was heavily focused on examining ChatGPT’s role in supporting different aspects of clinical nursing education: enhancing empathetic communication ( Benfatah et al., 2024), clinical decision-making ( Gunawan, Aungsuroch, and Montayre, 2024; Liu et al., 2023), physical assessment skills ( Chang et al., 2022; Karaçay and Yaşar, 2024) and academic writing ( Gunawan, Aungsuroch, and Montayre, 2024). Nursing students appreciated ChatGPT’s efficiency, speed and ability to reinforce learning by clarifying key concepts ( Gunawan, Aungsuroch, Marzilli, et al., 2024; Karaçay and Yaşar, 2024); benefits that could also improve students’ research learning experiences. Given the lack of studies on LLMs in nursing research education, there is a need to explore the use of LLMs in supporting students’ learning of foundational research concepts.
Moreover, studies have shown that LLM response quality is highly dependent on the specificity and clarity of prompts ( Knoth et al., 2024; P. Liu et al., 2023). Structured prompts with clear context and objectives produce more relevant and accurate responses ( Eager and Brunton, 2023). Thus, evaluating LLM performance with and without structured prompts merits further exploration in nursing education.
2 Aims and hypotheses
This study aimed to evaluate the performance of publicly available LLMs, ChatGPT-4o, ChatGPT-4o Mini and Perplexity AI, in responding to research-related questions commonly encountered at the undergraduate nursing level. The questions were categorized according to Bloom’s taxonomy, a widely recognized framework encompassing six cognitive skill levels ( Momen et al., 2022), allowing for analysis of response quality across these levels. Additionally, the study explored research team members’ perspectives on the potential of AI to support the teaching of foundational research concepts in undergraduate nursing education. The study was guided by the following hypotheses: (1) the quality of AI-generated responses would vary across different LLM platforms; (2) structured question prompts would yield higher-quality responses compared with unstructured prompts; and (3) the quality of responses would differ according to the cognitive levels of Bloom’s taxonomy to which the questions correspond.
This study used three widely accessible, free-to-use LLMs: ChatGPT-4o, ChatGPT-4o Mini (OpenAI) and Perplexity AI. These platforms were selected based on their public availability, popularity among student users (as observed anecdotally) and demonstrated technical stability in producing consistent responses to structured academic prompts ( Gunawan, Aungsuroch, and Montayre, 2024; Pan et al., 2023). Restricting the comparison to three platforms enabled a focused analysis of how prompt structure and question complexity—classified according to Bloom’s taxonomy—affected the quality of AI-generated responses, while still allowing for meaningful variation in model design and output characteristics. Although emerging LLMs such as Claude and Gemini were considered, they were excluded from the study as their core model functionalities closely mirrored ChatGPT during the planning phase. Both ChatGPT-4o and ChatGPT-4o Mini were included to allow for direct comparison within the same GPT-4 architecture family; the Mini version is optimized for efficiency while maintaining comparable reasoning and linguistic capabilities ( OpenAI, 2022, 2024).
A notable distinction between ChatGPT and Perplexity AI lies in their access to external information and citation practices. ChatGPT operates without real-time internet access and does not provide citations unless explicitly prompted, whereas Perplexity AI integrates real-time web access and generates citations by default ( OpenAI, 2022, 2024; Perplexity, Team. 2025). Consequently, Perplexity AI offers higher default model transparency, defined as the extent to which users can discern how and from where information is generated ( Fehr et al., 2022), compared with ChatGPT-4o and its Mini variant, which do not disclose their training data sources.
3 Methods
3.1 Study design
This study employed an exploratory case study design to assess the performance of three AI models (ChatGPT-4o, ChatGPT-4o Mini and Perplexity AI) in answering research foundation questions. This study was reported according to the transparent reporting of a multivariable model for individual prognosis or diagnosis (TRIPOD)-LLM guidelines (Supplementary File 4) ( Gallifant et al., 2025).
4 Generating test research questions
A total of 41 research questions spanning seven key research topics were developed to evaluate the performance of ChatGPT-4o, ChatGPT-4o Mini and Perplexity AI in supporting undergraduate nursing students. These topics included introduction to quantitative nursing research, qualitative nursing research, primary versus secondary research, randomized controlled trials, quasi-experimental studies and observational studies. The questions were designed to assess AI models' ability to support student learning across all six levels of Bloom’s taxonomy ( Armstrong, 2010): remember (n = 6), understand (n = 8), apply (n = 7), analyze (n = 5), evaluate (n = 9) and create (n = 6) ( Table S1, Supplementary File 1). Bloom’s taxonomy, a widely adopted educational framework since 1956, provides a structured approach to developing students’ cognitive skills ( Armstrong, 2010; Momen et al., 2022). In this study, research questions were designed based on Bloom’s taxonomy to evaluate AI-generated responses in relation to their potential to support and enhance student learning.
The 41 final questions, selected from an original list of 46, covered essential research concepts for beginners ( Table S1, Supplementary File 1). They were developed by two experienced nursing educators—one with over 15 years of experience (S.S.) and the other with three years (J.Y.X.C.)—who independently generated the questions before holding multiple discussions among the research team to finalize the list.
5 Outcome measure
An author-developed scale was used by the research team to holistically evaluate the quality of responses generated by each AI model using a 5-point Likert-type scale (1−5): 1 depicting low score and 5 being high. The scale was developed based on previous literature ( Teckwani et al., 2024) and content validated among nine experienced nursing and medicine educators with a content validity index (CVI) of 0.8. The CVI was determined by obtaining expert ratings on the relevance of each item using a four-point scale (1 =Not relevant, 2 =Somewhat relevant, 3 =Quite relevant, 4 =Highly relevant). The Item-Level Content Validity Index (I-CVI) was calculated by dividing the number of experts who rated an item as 3 or 4 by the total number of experts. The Scale-Level Content Validity Index (S-CVI) was then derived by averaging the I-CVIs of all items. The scale included six evaluation domains: Accuracy, Relevance, Clarity & Structure, Examples Provided, Critical Thinking and Referencing.
The Accuracy domain evaluated the correctness and precision of responses, while the Relevance domain assessed contextual appropriateness and comprehensiveness. Clarity and Structure examined the organization and readability of responses and Examples Provided assessed the relevance and usefulness of any examples included. Critical Thinking evaluated the accurate interpretation of research concepts and Referencing involved manual verification of the specificity and quality of cited sources. Further scoring details are available in Table S2, Supplementary File 1.
6 Design of question prompts
To assess the impact of prompt structures on AI performance, three variations were tested across all 41 questions. The prompts, developed by three researchers experienced in prompt engineering for nursing research education, were finalized via discussions to ensure they were appropriate for undergraduate students while providing enough context for the AI model. The first prompt (Prompt-1) only included the research question without additional context or guidance. In the second prompt (Prompt- 2), AI models were instructed to answer as an esteemed nursing professor with over a decade of experience teaching undergraduate nursing students. In the third prompt (Prompt-3), the user was specified to be undergraduate nursing student learning research for the first time and AI models were instructed to tailor their response to the user. For Prompt-2 and Prompt-3, the phrase “Please provide references” was included only for ChatGPT-4o and ChatGPT-4o Mini to evaluate the citation quality of the references provided. Perplexity AI, which automatically includes references, did not require this instruction. The phrase “Please provide references” was not added to Prompt-1 for ChatGPT-4o and ChatGPT-4o Mini to facilitate the evaluation of the automatic responses (without any references) generated by these models.
Prompt-2 and Prompt-3 included contextual information to evaluate whether AI models could adapt their responses based on the user’s role. These prompts evaluated how AI-generated responses varied according to the professor’s or student’s perspective, while targeting the same core learning outcomes. The exact phrasing of the prompts is presented in
7 AI models’ performance evaluation
Two researchers (J.Q.X.N. and J.Y.X.C.) independently evaluated the generated LLM outputs of the three AI models for the finalized set of 41 questions, each of which were tested under the three prompt conditions. The LLMs were accessed between 10 January to 14 February 2025, yielding six sets of answers in total (two sets per prompt). The specific details regarding the LLMs’ interface and chat settings are reported in
Table S3, Supplementary File 1. The same researchers (J.Q.X.N. and J.Y.X.C.) then independently rated the quality of all six answer sets (two per prompt) based on the six domains of the author-developed evaluation scale. The two researchers then conducted discussions to resolve any scoring discrepancies across all six response sets. When disagreements could not be reconciled, a third researcher (S.S.) was consulted. Inter-rater reliability for the six answer sets showed near perfect agreement, with the percentage agreement and Cohen’s kappa value between the two researchers ranging between 85 % and 90 % and 0.85–0.91 respectively. The mean inter-rater reliability statistics for the responses generated by each model per prompt are presented in
7.1 Consistency of AI models’ performance between different users
As each question under every prompt condition has two independently generated answers, each with corresponding domain scores finalized through extensive discussions, the consistency of AI models’ performance between different users was assessed by comparing domain scores across the two sets of answers per question. Two levels of consistency were assessed by comparing domain scores of all questions between the two answer sets: minor inconsistency (scores differing by ±1) and major inconsistency (scores differing by ≥±2).
After assessing the consistency of answers, the mean scores of both rating sets for all 41 research questions were calculated and used in the subsequent analyses.
7.2 AI models' performance across domains
For each prompt, the mean domain scores for each AI model were calculated by summing the scores in each domain and dividing by the total number of questions (n = 41) using Microsoft Excel. This facilitated two comparisons via visual inspection: 1) performance of the three AI models across six evaluation domains within each prompt and 2) performance of each AI model across six domains across the three prompts. Clustered bar graphs were used to visualize domain score trends across the three AI models and prompt conditions.
7.3 AI models' performance across domains according to Bloom’s taxonomy levels
For each prompt and AI model, the domain scores of all 41 research questions were first organized by the six levels of Bloom’s taxonomy: remember (n = 6), understand (n = 8), apply (n = 7), analyze (n = 5), evaluate (n = 9) and create (n = 6). Next, the mean domain scores for each AI model at each level were calculated by summing the scores for the assigned research questions and dividing by the number of questions at that level. This enabled three comparisons: 1) AI models’ performance in six evaluation domains for each Blooms’ taxonomy level within each prompt, 2) AI models’ performance in the six evaluation domains for each Blooms’ taxonomy level across all prompts and 3) direct comparison of three AI models’ performance in the six evaluation domains for each Bloom’s taxonomy level. Stacked bar graphs showing the cumulative scores of all AI models were used to illustrate trends in the six domain scores across Bloom's taxonomy levels and prompt structures.
8 Qualitative data from researchers
Qualitative insights on using AI models to teach nursing research were gathered from all researchers in this study who had experience in teaching nursing research to undergraduate students. They provided email-based responses to two open-ended questions: 1) What are the benefits and challenges of using AI models to teach undergraduate nursing students about research? and 2) What recommendations do you have for using AI models to support undergraduate nursing students in learning research?
9 Results
9.1 Consistency of AI models’ performance between different users
When Prompt-1 (question only) was used, the two researchers’ responses had the following minor (when the scores differed by ±1) inconsistency rates: ChatGPT-4o (48.6 %), ChatGPT-4o Mini (34.3 %) and Perplexity AI (21.4 %). Major inconsistency rates (when the scores differed by ≥ ±2) for Prompt-1 were: ChatGPT-4o (5.7 %), ChatGPT-4o Mini (0.0 %) and Perplexity AI (2.4 %). When Prompt-2 (professor’s perspective) was used, minor inconsistency rates were: ChatGPT- 4o (9.5 %), ChatGPT-4o Mini (11.9 %) and Perplexity AI (38.1 %). Major inconsistency rates for Prompt-2 (professor’s perspective) were: ChatGPT-4o (2.4 %), ChatGPT-4o Mini (0.0 %) and Perplexity AI (0.0 %). When Prompt-3 (student’s perspective) was used, minor inconsistency rates were ChatGPT-4o (26.2 %), ChatGPT-4o Mini (21.4 %) and Perplexity AI (28.6 %). Major inconsistency rates for Prompt-3 were: ChatGPT-4o (2.38 %), ChatGPT-4o Mini (0.0 %) and Perplexity AI (2.38 %). Overall, inconsistencies rates were highest when Prompt-1 was used, compared with Prompt-2 and Prompt-3.
The highest minor inconsistency rate between the two researchers' answers across all prompts was observed in the Critical Thinking domain (38.2 %), followed by Clarity & Structure (19.1 %) and Examples Provided (19.1 %). The Relevance and Referencing domains accounted for 16.0 % and 13.8 % of the minor inconsistency rate respectively, while the Accuracy domain had the lowest minor inconsistency rate of 4.3 %. No major inconsistencies were observed in the Accuracy and Relevance domains across all three prompts.
10 AI models' performance across domains within each prompt
The trends of AI models’ performance within each prompt are illustrated in
Under Prompt-1, Perplexity AI had lower mean scores across all domains—Accuracy (3.93), Relevance (3.58), Clarity & Structure (3.41), Examples (1.73) and Critical Thinking (1.99)—compared with ChatGPT-4o, which scored higher in all domains. ChatGPT-4o Mini outperformed both Perplexity AI and ChatGPT-4o, with the highest mean scores across all domains: Accuracy (4.43), Relevance (4.64), Clarity & Structure (4.77), Examples Provided (4.36) and Critical Thinking (3.33). The mean Referencing scores for Perplexity AI (2.43) could not be compared with the ChatGPT-4o and ChatGPT-4o Mini because both ChatGPT models did not generate any citations in the first prompt.
Under Prompt-2, Perplexity AI reported lower mean scores across all domains—Accuracy (4.25), Relevance (3.99), Clarity (3.99), Examples (2.20), Critical Thinking (3.17) and Referencing (2.76)—compared with ChatGPT-4o, which had higher scores in all domains. ChatGPT-4o Mini also outperformed Perplexity AI and had mean scores comparable to ChatGPT-4o across all domains.
Under Prompt-3, Perplexity AI reported lower mean scores across all domains—Accuracy (4.24), Relevance (4.08), Clarity & Structure (4.21), Examples Provided (1.73), Critical Thinking (2.71) and Referencing (2.41)—compared with ChatGPT-4o, which had higher scores in all domains. Conversely, ChatGPT-4o Mini outperformed Perplexity AI but had similar mean domain scores to ChatGPT-4o.
Appendix A (Supplementary File 2) shows all three AI models’ Prompt-1 responses to the question “What are the differences between primary and secondary research in nursing?” with the assigned domain scores and justifications for the assigned scores.
11 AI models' performance across domains and different prompts
The trends for each model in all domains across prompts 1–3 are shown in
The mean scores for the Clarity & Structure domain either improved or remained constant from Prompt-1 to Prompt-2 across all AI models: ChatGPT-4o (3.46 to 4.92), ChatGPT-4o Mini (remained at 4.77) and Perplexity (3.41 to 3.99). However, the mean scores for the Clarity & Structure domain remained relatively similar between Prompt-2 and Prompt-3 for all models. Moreover, ChatGPT-4o showed notable improvement in the Examples Provided domain from Prompt-1 (3.92) to Prompt-2 (4.57) and Prompt-3 (4.54). Additionally, Perplexity AI and ChatGPT-4o generated more accurate and relevant responses in Prompt-2 and Prompt-3 than in Prompt-1. For Perplexity AI, the mean scores of Accuracy and Relevance rose from below 4 in Prompt-1 to around or above 4 in Prompt-2 and Prompt-3. For ChatGPT-4o, the mean Accuracy scores increased from 4.27 in Prompt-1 to > 4.5 in Prompt-2 and Prompt-3, while the mean Relevance score increased from 4.12 to 4.63 and 4.40 across Prompts 1–3.
Moreover, all three AI models performed poorly for the Referencing domain in all prompts, with mean scores consistently below 3.50. Issues identified included invalid website links, unreliable sources (e.g. forums and blogs) and non-specific citations (references applied to large text sections). Furthermore, all AI models performed poorly in the Critical Thinking domain across all prompts, with most mean scores below 3.50. This was because most AI-generated responses mainly reiterated facts instead of critically interpreting the information. Appendix B (Supplementary File 2) presents ChatGPT-4o's responses the question “What is nursing research?” across all prompts, with the assigned domain scores and justifications for the assigned scores.
11.1 AI models' performance across domains according to Bloom’s taxonomy levels
All AI models generated adequate responses for research questions that correspond with all six different Bloom’s taxonomy levels; the models consistently achieved mean scores ≥ 3.5 in the Accuracy and Relevance domains across all taxonomy levels in all prompts (
12 Qualitative data from researchers
All researchers highlighted that AI models can provide convenient, timely and easy-to-understand information, making them “appealing” for nursing students to use while learning research. Researchers 2 and 4 felt that AI models can answer questions “much faster than manually searching through Google” and make learning “less tedious” compared with traditional resources (e.g. textbooks and journal articles). However, students must verify the AI-generated information against their official course materials. Researcher 3 warned that AI models can disrupt students’ learning by providing misinformation or lacking nuance in its responses. Researchers 1 and 4 added that students may struggle to formulate relevant questions, which could hinder their learning since AI responses are dependent on user input, creating a “big learning gap”.
All researchers recommended using AI models to “supplement nursing” lectures and research tutorials. Researcher 1 emphasized that nursing educators should guide students to maximize AI effectiveness, such as by providing a list of recommended questions and prompts (as suggested by Researchers 2 and 3). Researchers 1 and 4 proposed conducting prompt engineering lessons to help students refine the AI prompts to meet their needs, while Researchers 3 and 4 conducting faculty training to effectively integrate AI into research education. The qualitative responses are in
13 Discussion
This exploratory case study evaluated and compared the performance of ChatGPT-4o, ChatGPT-4o Mini and Perplexity AI in answering foundational research concepts-related questions for undergraduate nursing students using three different prompts (Prompt-1: Unstructured with no context; Prompt-2: Structured from professor’s perspective; Prompt-3: Structured from student’s perspective). Feedback was gathered from the research team on their use of AI models to answer research questions for undergraduate nursing students.
This study found that AI models performed better with structured prompts, as responses in Prompt-2 and Prompt-3 were clearer, more accurate, relevant and well-structured than in Prompt-1. Additionally, AI models showed more consistency across users with structured prompts. The relatively lower user inconsistency rates in Accuracy and Relevance suggest that AI-generated responses can offer consistent research content and support undergraduate nursing students’ learning. Conversely, the inconsistencies in the other four non-content-specific domains (Clarity & Structure, Examples Provided, Critical Thinking and Referencing) could result from natural variation in AI-generated responses, not inaccurate information ( Zewe, 2023). To mitigate these inconsistencies, students could engage in iterative AI interactions by requesting paraphrased explanations for clarity and ask for more examples and critical interpretations. References could be verified independently without the use of AI. Since structured prompts improve AI-generated response quality, which could potentially improve students’ learning outcomes, future research in nursing education could explore prompt development and evaluation.
Moreover, prompt engineering is needed to optimize AI-generated responses and structured frameworks such as the six-component framework by Eager and Brunton (2023) can help develop more effective prompts (P. Liu et al., 2023). Previous research also highlighted the importance for higher education educators such as nursing educators working in academic institutions to develop AI literacy and prompt engineering skills to align AI-generated content with course-specific learning objectives ( Chiu et al., 2023; Kohnke et al., 2023; Simms, 2025). Mastering this skill could help nursing educators leverage AI to enhance students’ understanding, theory application and problem-solving skills ( Eager and Brunton, 2023). Consistent with this study’s qualitative findings, prior research underscores the importance of embedding foundational training in AI literacy and prompt engineering in undergraduate nursing students’ curricula ( Knoth et al., 2024; Simms, 2025). Such training programs could help nursing students to learn the principles of AI-generated content, including how to craft effective prompts and evaluate the relevance, accuracy and limitations of AI responses ( Knoth et al., 2024; Simms, 2025).
Current findings indicate that AI models frequently generate non-specific or non-credible references irrespective of prompt structure, raising concerns about misinformation hindering nursing students’ learning, as noted in previous research ( Colasacco and Born, 2024; Simms, 2025). Students’ ability to identify credible sources may be compromised if AI-generated responses lack high-quality references, limiting their exposure to reliable information models. Given that literature searching is a critical component of the research process ( Grewal et al., 2016), the limited referencing capabilities of some AI tools present a potential concern for their use in nursing research education. However, previous studies suggested that inaccuracies in AI-generated responses might paradoxically enhance students’ learning by encouraging them to verify the responses and seek clarification from educators ( Gunawan, Aungsuroch, Marzilli, et al., 2024; Karaçay and Yaşar, 2024). Since counter-checking AI-generated content with credible sources promotes critical thinking, information retention and research literacy ( Gunawan, Aungsuroch, Marzilli, et al., 2024; Karaçay and Yaşar, 2024; Simms, 2024), nursing educators could teach students to critically evaluate AI responses and hone their abilities to conduct literature searching. An additional method to improve the accuracy and referencing abilities of AI-generated responses could involve the usage of retrieval-augmented generation (RAG) to ground the responses in key textbooks and articles on the fundamentals of nursing research ( Elkin et al., 2025).
Despite their promising ability to answer foundational nursing research questions, this study found that all AI models had occasional inaccuracies or omissions and tended to generate responses with low levels of critical analysis. Previous studies similarly reported that LLMs could achieve around 60–80 % accuracy in knowledge-based questions and only 50–60 % accuracy in questions requiring critical thinking ( Kung et al., 2023; Most et al., 2024; Simms, 2025). These limitations arise from the AI model’s dependence on its training data quality which limits its grasp of complex concepts, leading to biased and overly general responses ( Antu et al., 2023). The inability of AI models to critically analyze data is particularly concerning nursing research education, where students are expected to appraise evidence and synthesize insights. Over-reliance on AI may lead to superficial understanding, diminished critical thinking skills and ethical issues such as plagiarism and academic dishonesty ( Lund et al., 2025; Zhang et al., 2024). Therefore, it is essential for nursing educators to address these concerns in AI literacy training. Given these risks, AI tools should be used only as supplementary aids to support learning and encourage deeper engagement through structured coursework, discussion and hands-on application ( Funa and Gabay, 2025).
This study found patterns in domain scores of AI-generated responses across Bloom’s taxonomy levels. AI models demonstrated higher accuracy and relevance and critical interpretation for the “Analyze” and “Evaluate” questions, likely due to the questions’ higher specificity and need for structured reasoning. For example, an “Evaluate” level question was: “How to decide whether to conduct primary or secondary research in nursing?” Conversely, responses to the “Remember” and “Understand” questions scored lowest in critical thinking since they focused more on rote recall than interpretative reasoning ( Armstrong, 2010). Interestingly, AI models generated more examples for the “Remember” and “Evaluate” questions, likely due to their suitability for straightforward examples. Despite domain score variations across Bloom’s taxonomy levels, AI models performed adequately at all six levels. As Bloom’s taxonomy promotes progressively deeper cognitive engagement among learners ( Armstrong, 2010; Momen et al., 2022), undergraduate nursing students could potentially deepen their research understanding by engaging with AI using questions from all Bloom's taxonomy levels. Hence, Bloom’s taxonomy can be incorporated into students’ prompt engineering lessons to possibly encourage deeper levels of self-learning among students with AI.
This study found no distinct difference in AI models’ performance between Prompt-2 and Prompt-3, indicating that both prompts could potentially generate responses to support nursing students’ learning despite their structural differences (Prompt-2: professor’s viewpoint, Prompt-3 : student’s viewpoint). With the ability to simplify research concepts, AI models can not only support undergraduate nursing students’ learning but also serve as valuable training tools for educators. A recent review highlighted AI’s potential in lesson planning and educator training ( Tan et al., 2025). Building on this, future nursing educators could leverage AI tools to enhance their proficiency in teaching foundational research concepts.
This study showed that both ChatGPT-4o and ChatGPT-4o Mini outperformed Perplexity AI across all prompts, likely due to its conversational and explanatory style that is closer to human interactions, unlike Perplexity AI’s concise and factual approach ( Jo and Park, 2024). Hence, both ChatGPT-4o and ChatGPT-4o Mini mimicked effective teaching techniques implemented by teachers by generating more comprehensive and clearer responses with examples, making it possibly more proficient at explaining complex research concepts ( Arora, 2024; Jo and Park, 2024). Since ChatGPT-4o and ChatGPT-4o Mini demonstrated comparable performance, they can be implemented as cost-effective tools for teaching and learning nursing research concepts, especially in resource-limited settings. As LLMs advance, future research could explore new AI models and features for teaching research concepts such as the deep research features offered by ChatGPT models and Perplexity AI ( OpenAI, 2025; Perplexity, Team. 2025).
14 Strengths and limitations
As this study was conducted using English-language prompts by researchers fluent in English, the findings may not be generalizable to non-English speakers or users who interact with AI models in other languages. Since only three AI models were evaluated on 41 foundational research questions, the findings may not be generalized to all AI models, emerging research topics or the process of conducting independent research. By testing only two simple prompts, this study does not reflect AI models’ performance with more complex or nuanced prompts. As this study did not involve data collection from nursing students, the findings reflect only educators’ perspectives on AI’s potential impact on student learning. The assumption that higher-quality AI-generated responses lead to improved learning outcomes may not be held in practice, given the complex interplay of academic, personal, social and demographic factors influencing student learning ( Al-Tameemi et al., 2023). Therefore, no definitive conclusions can be drawn about AI’s effect on actual nursing students’ research learning outcomes based on this study. Advanced features such as deep research offered by ChatGPT models and Perplexity AI were not assessed due to their unavailability for public use during the study (10 January to 14 February 2025). This study only assessed AI response inconsistencies between two researchers, limiting its ability to estimate variability across multiple users. Since the researchers generated and evaluated all AI responses, potential detection bias was introduced. However, efforts to minimize this bias included an independent rating process and extensive discussions to reach consensuses for all responses’ domain scores. As Perplexity AI retrieves real-time information when answering questions, this study did not control for the possible effects of time on its responses. However, since all AI generated responses were retrieved within a short timeframe of one month (10 January to 14 February 2025) and nursing research methodology is a relatively non-time-sensitive topic, the temporal effect of Perplexity AI’s responses is likely to be negligible in this study. Lastly, this study’s findings only reflect the current performance of ChatGPT-4o, ChatGPT-4o Mini and Perplexity AI in answering foundational research-related questions, with AI-generated response quality likely to improve as technology advances.
Despite these limitations, this study is the first to explore how AI models can potentially support undergraduate nursing students in learning foundational research concepts, paving the way for future research. Using Bloom’s taxonomy as a pedagogical framework to guide the question development process also enhanced this study’s rigor. Lastly, the combination of quantitative scoring with qualitative insights from nursing research educators also enhances the depth of this study’s findings.
15 Implications for future practice and research
Nursing educators can integrate AI models as supplementary tools to support students’ learning of foundational research concepts. To do so effectively, educators may benefit from training in AI literacy and prompt engineering aligned with course objectives. Using structured questions across all levels of Bloom’s taxonomy may further enhance students’ engagement and comprehension. Given both the potential benefits and risks of AI in nursing research education, integrating AI literacy and prompt engineering into undergraduate curricula can equip students to use AI tools responsibly and effectively. However, due to the potential for inaccuracies and limited critical analysis in AI-generated responses, these tools should not replace traditional coursework or educators’ guidance.
Future studies could examine the impact of AI models on nursing students’ research learning experiences and competencies by directly assessing student engagement and academic outcomes. Given their strong performance, ChatGPT-4o and ChatGPT-4o Mini may be used to explore the effects of more complex, structured prompts on response quality. Research could also investigate how AI supports nursing educators’ pedagogical approaches and how RAG techniques enhance response accuracy. Additionally, future work may explore other AI models and advanced features, such as the deep research functions available in ChatGPT and Perplexity AI.
16 Conclusion
This study’s mixed-method approach bridged technical performance metrics of AI models with nursing educators’ perspectives on research education. Findings suggest that AI models particularly ChatGPT-4o and ChatGPT-4o Mini, have the potential to support undergraduate nursing students’ learning of foundational research concepts across all levels of Bloom’s taxonomy, especially when structured prompts are used. However, as results are based solely on educators’ insights, the impact of AI on actual student learning outcomes warrants further investigation. Nursing educators are encouraged to explore AI as a supplementary tool in research instruction. Both educators and students may benefit from targeted training in AI literacy and prompt engineering. Future research should examine the use of RAG to improve AI response accuracy and explore how AI can enhance teaching practices. Given the associated risks, such as misinformation and academic integrity breaches, AI should be used to complement not replace students’ active engagement with course content and educator guidance.
CRediT authorship contribution statement
Choolani Mahesh: Writing – review & editing, Visualization, Validation. Li Sarah WL: Writing – review & editing, Visualization, Validation. Foo Lin: Writing – review & editing, Visualization, Validation. Travis Lanz-Brian PEREIRA: Writing – review & editing, Writing – original draft, Visualization, Validation. Shorey Shefaly: Writing – review & editing, Writing – original draft, Visualization, Validation, Supervision, Project administration, Methodology, Investigation, Formal analysis, Data curation, Conceptualization. Ng Jamie Qiao Xin: Writing – review & editing, Writing – original draft, Visualization, Validation, Methodology, Investigation, Formal analysis, Data curation. Chua Joelle Yan Xin: Writing – review & editing, Writing – original draft, Visualization, Validation, Methodology, Investigation, Formal analysis, Data curation.
Declaration of Generative AI and AI-assisted technologies in the writing process
During the preparation of this work the authors used ChatGPT-4o Mini / OpenAI to do language editing of our work and ensure that the paper fits the journal’s word limit. After using this tool/service, the authors reviewed and edited the content as needed and take full responsibility for the content of the published article.
Funding
This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Appendix A Supporting information
Supplementary data associated with this article can be found in the online version at
Appendix A Supplementary material
Supplementary material
Supplementary material
Supplementary material
Supplementary material
Table 1
| Prompt number | Exact phrasing and example |
| Prompt−1
(only the question itself) |
Example of one question:
What are the differences between research aim, research question, and research hypothesis? |
| Prompt−2
(professor’s perspective) |
You are an esteemed professor in nursing research with more than a decade of experience in teaching new undergraduate students about nursing research. These students are novice researchers learning about research for the first time. Bearing in mind your role, you will need to answer the following question in a manner that these students nurses can comprehend: What are the differences between research aim, research question, and research hypothesis? (Example of one question)
Please provide references (this phrase is only added for ChatGPT−4o and ChatGPT−4o Mini) |
| Prompt−3
(student’s perspective) |
I am an undergraduate university nursing student who is learning about nursing research for the first time. Please answer the following question in a manner that I can comprehend: What are the differences between research aim, research question, and research hypothesis? (Example of one question)
Please provide references (this phrase is only added for ChatGPT−4o and ChatGPT−4o Mini) |
Table 2
| Domain/IRR | AI models | ||
| Prompt 1 | |||
| ChatGPT−4o | ChatGPT−4o Mini | Perplexity AI | |
| Accuracy | 4.27 | 4.43 | 3.93 |
| Relevance | 4.12 | 4.64 | 3.58 |
| Clarity and structure | 3.46 | 4.77 | 3.41 |
| Examples provided | 3.92 | 4.26 | 1.73 |
| Critical thinking | 3.31 | 3.33 | 1.99 |
| Referencing | - | - | 2.43 |
| % agreement, Kappa value | 86, 0.87 | 89, 0.90 | 87, 0.86 |
| Prompt 2 | |||
| ChatGPT−4o | ChatGPT−4o Mini | Perplexity AI | |
| Accuracy | 4.56 | 4.43 | 4.25 |
| Relevance | 4.63 | 4.64 | 3.99 |
| Clarity and structure | 4.92 | 4.77 | 3.99 |
| Examples provided | 4.57 | 4.26 | 2.20 |
| Critical thinking | 3.60 | 3.34 | 3.17 |
| Referencing | 2.91 | 3.20 | 2.76 |
| % agreement, Kappa value | 90, 0.91 | 87, 0.88 | 85, 0.86 |
| Prompt 3 | |||
| ChatGPT−4o | ChatGPT−4o Mini | Perplexity AI | |
| Accuracy | 4.52 | 4.41 | 4.24 |
| Relevancy | 4.40 | 4.50 | 4.08 |
| Clarity and structure | 4.80 | 4.87 | 4.21 |
| Examples provided | 4.54 | 4.38 | 1.73 |
| Critical thinking | 3.35 | 3.12 | 2.71 |
| Referencing | 3.34 | 2.68 | 2.41 |
| % agreement, Kappa value | 89, 0.88 | 86, 0.85 | 88, 0.89 |
Table 3
| Prompt 1 | ||||||||||||||||||
| Level | Remember | Understand | Apply | Analyze | Evaluate | Create | ||||||||||||
| AI model | 4o | 4o Mini | Perplex | 4o | 4o Mini | Perplex | 4o | 4o Mini | Perplex | 4o | 4o Mini | Perplex | 4o | 4o Mini | Perplex | 4o | 4o Mini | Perplex |
| Accuracy | 3.92 | 4.17 | 3.58 | 3.94 | 4.31 | 4.00 | 4.07 | 4.14 | 4.00 | 4.70 | 4.60 | 4.10 | 4.56 | 4.78 | 3.94 | 4.42 | 4.17 | 3.75 |
| Relevance | 4.00 | 3.50 | 3.25 | 3.81 | 3.56 | 3.50 | 3.93 | 3.79 | 3.71 | 4.30 | 4.10 | 3.90 | 4.28 | 4.44 | 3.67 | 4.42 | 4.25 | 3.25 |
| Clarity and structure | 3.25 | 2.92 | 3.25 | 3.63 | 3.19 | 3.69 | 3.50 | 3.29 | 3.21 | 3.70 | 3.50 | 3.20 | 3.44 | 3.44 | 3.39 | 3.17 | 3.08 | 2.58 |
| Examples provided | 4.50 | 4.50 | 1.58 | 2.94 | 3.94 | 1.94 | 4.21 | 4.71 | 1.50 | 3.80 | 4.10 | 1.40 | 4.00 | 4.00 | 2.78 | 3.58 | 4.00 | 2.42 |
| Critical thinking | 3.08 | 3.17 | 1.50 | 2.56 | 2.75 | 2.25 | 3.21 | 3.43 | 1.79 | 3.30 | 3.70 | 2.50 | 4.06 | 3.89 | 1.94 | 3.50 | 3.58 | 2.50 |
| Referencing | - | - | 1.83 | - | - | 3.06 | - | - | 2.71 | - | - | 2.20 | - | - | 2.61 | - | - | 3.33 |
| Prompt 2 | ||||||||||||||||||
| Level | Remember | Understand | Apply | Analyze | Evaluate | Create | ||||||||||||
| AI model | 4o | 4o Mini | Perplex | 4o | 4o Mini | Perplex | 4o | 4o Mini | Perplex | 4o | 4o Mini | Perplex | 4o | 4o Mini | Perplex | 4o | 4o Mini | Perplex |
| Accuracy | 4.42 | 4.25 | 4.17 | 4.56 | 4.25 | 4.25 | 4.36 | 4.00 | 4.14 | 4.70 | 4.60 | 4.60 | 4.72 | 4.78 | 4.33 | 4.50 | 4.83 | 3.92 |
| Relevance | 4.75 | 4.58 | 3.92 | 4.50 | 4.44 | 4.06 | 4.36 | 4.50 | 3.86 | 4.80 | 4.70 | 4.20 | 4.83 | 4.78 | 4.00 | 4.50 | 4.92 | 3.83 |
| Clarity and structure | 5.00 | 4.58 | 4.17 | 4.88 | 4.69 | 4.06 | 4.86 | 4.79 | 4.00 | 5.00 | 4.90 | 4.20 | 4.39 | 4.44 | 3.61 | 4.83 | 4.42 | 2.83 |
| Examples provided | 4.75 | 5.00 | 3.25 | 4.50 | 3.75 | 2.13 | 4.93 | 4.43 | 2.43 | 4.00 | 3.70 | 2.30 | 4.78 | 4.67 | 2.89 | 4.50 | 3.67 | 2.42 |
| Critical thinking | 3.50 | 3.00 | 2.92 | 2.75 | 2.44 | 2.63 | 3.64 | 3.43 | 3.21 | 4.00 | 3.70 | 3.70 | 4.28 | 3.89 | 3.50 | 3.50 | 3.75 | 3.08 |
| Referencing | 3.08 | 3.33 | 2.25 | 3.19 | 3.13 | 2.88 | 2.71 | 3.21 | 3.21 | 2.80 | 3.10 | 2.90 | 3.11 | 3.39 | 2.50 | 2.67 | 3.00 | 3.33 |
| Prompt 3 | ||||||||||||||||||
| Level | Remember | Understand | Apply | Analyze | Evaluate | Create | ||||||||||||
| AI model | 4o | 4o Mini | Perplex | 4o | 4o Mini | Perplex | 4o | 4o Mini | Perplex | 4o | 4o Mini | Perplex | 4o | 4o Mini | Perplex | 4o | 4o Mini | Perplex |
| Accuracy | 4.42 | 4.17 | 4.08 | 4.56 | 4.25 | 4.25 | 4.36 | 4.43 | 4.21 | 4.80 | 4.30 | 4.30 | 4.56 | 4.78 | 4.33 | 4.33 | 4.33 | 3.92 |
| Relevance | 4.33 | 4.42 | 4.08 | 4.25 | 4.25 | 4.19 | 4.36 | 4.64 | 4.21 | 4.60 | 4.70 | 4.30 | 4.44 | 4.67 | 3.89 | 4.50 | 4.50 | 3.83 |
| Clarity and structure | 4.92 | 4.75 | 4.08 | 4.88 | 4.63 | 4.13 | 4.71 | 4.86 | 4.36 | 4.80 | 4.90 | 4.10 | 4.33 | 4.44 | 3.72 | 4.17 | 4.67 | 3.17 |
| Examples provided | 4.92 | 4.75 | 2.50 | 4.50 | 3.50 | 1.31 | 4.86 | 4.43 | 1.86 | 4.20 | 4.10 | 2.00 | 4.72 | 4.83 | 3.33 | 3.50 | 4.00 | 2.17 |
| Critical thinking | 3.00 | 2.50 | 2.17 | 2.88 | 2.50 | 2.50 | 3.43 | 2.86 | 2.64 | 3.20 | 3.20 | 3.10 | 3.89 | 3.78 | 2.94 | 3.67 | 3.25 | 2.83 |
| Referencing | 3.58 | 2.67 | 2.33 | 3.44 | 2.69 | 2.38 | 3.43 | 2.71 | 3.14 | 3.10 | 2.70 | 2.10 | 3.33 | 2.72 | 2.06 | 3.17 | 2.50 | 2.33 |
Table 4
| 1) What are the benefits and challenges of using AI models to teach undergraduate nursing students about research? |
|
Benefits:
• “I think that students would enjoy using ChatGPT and Perplexity to answer research questions. The answers are generated very quickly and if they don’t understand anything, they can easily clarify without feeling judged.” (Researcher 1) • “Using ChatGPT and Perplexity to answer research questions is very convenient. The answers are generated almost immediately, and I think that students will find this process much faster than manually searching through Google for answers or reading through textbooks, slides, notes, and journal articles. It is a very appealing method for student nurses to learn research.” (Researcher 2) • “The integration of AI models into nursing education offers several advantages in teaching research. First, AI-driven tools provide immediate and personalized feedback, allowing students to clarify concepts and reinforce their understanding in real time. Second, AI models can enhance accessibility and flexibility, enabling students to engage with research-related content at their own pace. Third, AI-powered tutoring systems can support self-directed learning, guiding students through research methodologies, data analysis, and critical appraisal of evidence. Additionally, AI models can facilitate interactive learning by generating research scenarios, case studies, and quizzes tailored to individual learning needs.” (Researcher 3) • “AI can be a huge help for students who struggle with research concepts. It acts like a tutor, guiding them through the basics before they dive into the more complicated stuff. Plus, if AI tools are made free or affordable, it could help level the playing field for students who don’t have access to expensive textbooks or other learning resources.” (Researcher 4) • “AI tools like ChatGPT and Perplexity AI are a game changer for students because they provide instant responses. Instead of getting stuck on a concept and feeling frustrated, students can clear up their doubts in real time. This keeps them focused and helps them reinforce their understanding without distractions.” (Researcher 4) • “AI platforms feel more interactive and personalized, which makes learning research more engaging. Traditional research teaching can sometimes feel dry or too dependent on how well the instructor teaches. AI gives students an alternative way to learn that’s more in tune with their pace and style. Instead of being bombarded with tons of information at once, AI breaks things down gradually in a conversational way, making it feel more approachable.” (Researcher 4) • “One of the best things about AI is how much time it saves. Instead of spending hours searching through multiple sources, trying to piece together information, students can use AI to quickly summarize journal articles and get key insights. This means they can focus more on understanding and analysing the research rather than just collecting information.” (Researcher 4) Challenges:• “Students may not know what questions to ask if they have no foundational knowledge of research. This could affect their learning because the information provided by ChatGPT and Perplexity highly depends on the users’ input.” (Researcher 1) • “Students would still need to be cautious when interpreting the answers provided by these platforms and check if the information provided aligns with what they are taught in school.” (Researcher 2) • “Currently, ChatGPT and Perplexity ‘vomits’ out a lot of information in their answers. Students might struggle to organize and draw links between all the information provided. This organization of information might be challenging for students who are completely new to research as they may not know which points to prioritize.” (Researcher 2) • “Despite these benefits, there are notable challenges. One concern is accuracy and reliability, as AI-generated responses may occasionally include misinformation or lack context-specific insights. Students must develop critical thinking skills to assess AI outputs critically. Additionally, there is a risk of over-reliance on AI, which may hinder the development of independent research and analytical skills. Another challenge is ethical considerations, particularly regarding data privacy and the responsible use of AI-generated content. Lastly, faculty readiness and institutional support are crucial for the successful integration of AI tools, as educators need training to effectively incorporate AI into teaching and assessment.” (Researcher 3) • “AI often generates long, detailed responses, which can be too much for students who are just starting out. If students don’t let AI know they’re beginners, the system might assume they have more background knowledge than they actually do, making the response confusing rather than helpful.” (Researcher 4) • “AI isn’t perfect. It can sometimes generate incorrect information, false citations, or rely on sources that aren’t the best quality—especially since it can’t access most subscription-based academic journals. Sometimes, it just skims abstracts rather than analysing full research papers, which means students might miss out on important details.” (Researcher 4) • “A lot of students don’t know how to ask the right questions to get useful AI responses. This can create a big learning gap—if schools don’t teach students how to write good prompts, those who struggle with language or aren’t as tech-savvy might end up falling behind. AI could unintentionally favour students who already have strong digital literacy skills, making things even harder for those who don’t.” (Researcher 4) |
| 2) What recommendations do you have for using AI models to support undergraduate nursing students in learning research? |
|
• “Students could benefit from using ChatGPT and Perplexity to answer research questions and learn the foundations of research. This valuable resource can supplement their formal lectures and tutorials on research.” (Researcher 1) • “Students could use ChatGPT and Perplexity in their own time to learn more about research and reinforce the knowledge learnt from their formal school lessons.” (Researcher 2) • “In this study, we found that the answers provided by ChatGPT and Perplexity improved in clarity when the AI models were prompted to answer the questions in a manner suitable for student nurses to comprehend. Hence, faculty members could guide students on how to use ChatGPT and Perplexity so that they can maximize the benefits gained from these AI models. Students could also learn how to develop their own short prompts to take charge of their own learning.” (Researcher 1) • “Since students might not know what questions to ask ChatGPT and Perplexity, faculty members could curate a list of suggested questions based on the students’ syllabus for students to use when interacting with the AI models. Faculty members could also provide some prompt suggestions for students to use so that the AI models can provide them with more relevant responses to their questions.” (Researcher 2) :• “Use AI as a Supplement, Not a Replacement – AI should be integrated as a supportive learning tool rather than a substitute for traditional research instruction. Faculty should guide students in using AI critically and ethically.” (Researcher 3) • “Incorporate Structured Prompts – Structured and well-designed prompts can help ensure more accurate and contextually relevant AI-generated responses, improving learning outcomes.” (Researcher 3) • “Encourage Critical Appraisal Skills – Students should be trained to critically evaluate AI-generated content, cross-referencing it with credible sources to develop their research literacy.” (Researcher 3) • “Promote Ethical AI Use – Institutions should establish guidelines on responsible AI usage, addressing issues such as academic integrity, data privacy, and plagiarism prevention.” (Researcher 3) • “Provide Faculty Training – Educators should receive training on AI tools to effectively integrate them into research education and guide students in using AI appropriately.”(Researcher 3) • “Use AI for Interactive Learning – AI-driven simulations, chat-based research assistants, and automated assessments can enhance engagement and provide personalized learning experiences.” (Researcher 3) • “Make AI Literacy Part of the Curriculum: AI isn’t going anywhere, so we might as well teach students how to use it properly. This means helping them understand how to evaluate AI-generated content, cross-check information with credible sources, and use AI responsibly. Schools should also set clear guidelines on plagiarism and ethical AI use so students don’t unintentionally misuse it.” (Researcher 4) • “Train Educators on AI Too: It’s not just students who need AI training—faculty members should also get comfortable using these tools so they can integrate them into their teaching. Yes, there will be resistance, but this is just like how people once resisted computers or online learning (before Covid). The sooner we embrace AI, the better we can guide students on how to use it effectively.” (Researcher 4) • “Encourage Student-Led AI Exploration: Instead of just giving students AI-generated answers, we should encourage them to experiment with AI themselves—learning how to refine their prompts, analyse responses, and think critically about what they read. This will make them more independent learners and better researchers in the long run.” (Researcher 4) |
© 2025 Elsevier Ltd