Content area
Background
Examinations are used widely in nursing education to evaluate knowledge attainment. New item types were initiated in April 2023 by the National Council of State Boards of Nursing (NCSBN) for use on the Next Generation National Council Licensure Examination for Registered Nurses (NGN NCLEX-RN). Little evidence exists for how much time is needed for exams that use the new item types.
Method
Item analysis was conducted on 772 questions that were answered by 14,728 nursing students. Seven item types were analyzed, including calculation, cloze, matrix, ordered response, hot spot, multiple choice, and multiple response.
Results
Calculation questions required additional time. Multiple choice items required 1 minute per question. Questions that were more difficult required more time. Exams administered early in the nursing curriculum need more time allocated.
Conclusion
Nursing educators should focus on time allotment for examinations based on the item types and difficulty.
Examinations are used throughout the prelicensure nursing education curriculum to measure student competency and monitor performance throughout a given program. Various exam item types are used to assess student knowledge and have differing effects on student response time. Students' cumulative performance on these item types determines whether their test-taking abilities are satisfactory for the respective licensure examination.
The National Council Licensure Examination for Registered Nurses (NCLEX-RN®), administered by the National Council of State Boards of Nursing (NCSBN), uses computerized adaptive testing (CAT) to test RN competency in the four major Client Needs categories: (1) Safe and Effective Care Management; (2) Health Promotion and Maintenance; (3) Psychosocial Integrity; and (4) Physiological Integrity. In April 2023, the latest iteration, Next Generation NCLEX® (NGN), became effective. The new exam format includes case studies and bowtie item types to assess students' application of the clinical judgment measurement model (CJMM). It employs several new exam item types to measure these core competencies, including extended multiple response, cloze, enhanced hot spot, and matrix question types (NCSBN, 2023). Thus, with the new exam format, nursing candidates are now required to answer more types of items in different formats than in previous iterations. The new format calls into question how the amount of time allotted per exam item is determined, and this may affect how nursing programs design their own exams and prepare students to sit for the licensure examination.
Students differ in their speed of answering questions during exams, and different item types require varying amounts of cognition efforts and speed to respond (Klein Entink et al., 2009). Students do not divide their time evenly between exam items, as some items require steps and working memory capacity to arrive at an answer (Klein Entink et al., 2009). This factor makes calculating how much time is required per question increasingly complicated. Item complexity and difficulty, along with the examinee's academic abilities and speed, must be included when considering the effects of test format and question types to calculate response time.
Instructors often employ a “rule of thumb” to give students approximately 1 minute per multiple-choice question (Brothen, 2012). Several education instruction manuals reference this general rule, yet none cite any source from which this rule is derived (Bridgeman et al., 2007; Renner & Renner, 1999; Svinicki & McKeachie, 2014). Despite this, there is general agreement within these sources that 1 minute per question is a generous time estimate.
Many examinee characteristics can account for variable response times in exam taking. One such factor is whether the students classify as “high” or “low achieving.” High achievers spend less time on exam questions than low-achieving students (Brothen, 2012; Tsaousis et al., 2018). Because of this, higher response times often are associated with incorrect responses (Brothen, 2012; Tsaousis et al., 2018). This trend holds for nonadaptive exams but is not necessarily true for adaptive exams such as NCLEX. In adaptive exams, higher achieving students receive more difficult exam items as the exam progresses, increasing candidate response times (Bridgeman et al., 2007).
Response times have been analyzed by standardized college entrance boards such as the College Board, which administers the Scholastic Aptitude Test (SAT). SAT CAT responses collected by the College Board pointed to differences in response time within several sections of the exam, including the verbal and math portions. On average, students took 30 seconds to complete verbal analogy items. In this same report, math computations took longer than verbal items, with a mean of 92 seconds and a standard deviation of 25 (Bridgeman et al., 2007). Another study investigated differences in physician response times between multiple choice format and complex graphic-intensive multiple response (CGIMR) item format questions on a medical certification exam. The CGIMR format tested physicians' abilities to interpret electrocardiograms. The results showed that multiple choice format questions on average took less time to complete than other types analyzed (Hess et al., 2013).
After a formative or summative nursing examination, faculty should evaluate the item and exam performance prior to releasing exam results. The difficulty index, or p value, is calculated by dividing the number of those test takers who answered the item correctly by the total number of students. If the difficulty index is greater than 0.75, the item is categorized as easy, and if the difficulty index is below 0.25, it is categorized as difficult. Point biserial (PB) is a measure of item discrimination and used to determine the correlation between students' scores on dichotomous items (i.e., either right or wrong) and the students' total exam scores. It determines the strength and direction of one dichotomous variable (item) and one continuous variable (total exam). The PB values that are between the range of 0.2 to 0.29 are considered reasonably good, with a value of 0.3 representing a particularly good item. As mentioned, a point biserial can only be calculated on dichotomously scored items (Lane et al., 2015). The discrimination index measures item quality, comparing the top performing students in the overall exam to the low performers. A higher discrimination index indicates that items appropriately differentiate between high and low performers and correct responses are less likely to be due to chance. The goal is for each item's discrimination index to be greater than 0.15 (Bridge et al., 2003).
Existing research on item response time has been minimal. Empirical research in this area is suspected to be unavailable mainly because proprietary testing companies perform most item timing research (Brothen, 2012). This study investigated item analysis and response times on various question types in nursing examinations.
Method
This project used a descriptive comparative study design using retrospective data. Institutional review board approval with exempt status was obtained. Standardized exams that were administered to nursing students during the fall of 2023 were analyzed.
Exam items were retrieved from two fundamentals exams, two medical-surgical exams, and two readiness exams. The fundamentals exams had 50 questions each, the medical-surgical exams had 75 questions each, and the two readiness exams had 100 questions each. The exams are administered by an electronic software platform; data retrieved included type of question, questions per exam, time per item, difficulty index, point biserial, and discrimination index.
There were seven types of questions on the exams: (1) calculation items require the student to perform a mathematical function to determine the correct dose of a medication; (2) drop-down cloze items include fill-in-the-blank sentences in which each blank provides the test-taker with a drop-down list of options; (3) matrix items can be either multiple choice or multiple response and are set up using a grid or table; (4) ordered response items require the student to arrange the options in sequential order; (5) hot spot items require the student to click on an area of a graphic image or patient chart to answer the question correctly; (6) multiple choice items allow the student to choose one correct answer out of four options; and (7) multiple response items allow the student to choose more than one correct answer from up to 10 options. Several item types, including calculation, matrix, order, hot spot, and multiple-choice items were scored dichotomously, either being completely right or wrong. Cloze, matrix, and multiple response items were scored polytomously, as the items could have more than one correct answer. Data were analyzed using SPSS® version 28.0. Descriptives, t tests, and analysis of variance (ANOVA) were performed on the dataset.
Results
Our analytic plan was to evaluate multiple facets of the dataset. Initially, we looked at response time and item types. Then we looked at response time and exam types (i.e., fundamentals, medical-surgical, and readiness). The next step was to look at response time and item analysis. We compared the difficulty index, divided into three levels of difficulty (low, medium, and high) and response time. Finally, the point biserial and discrimination index were evaluated by item types. This required multiple ANOVAs to evaluate the data.
The exam items were administered to 14,728 students from associate and baccalaureate degree nursing programs in the United States. The number of exam takers by exam is listed in Table 1. A total of 772 items were analyzed, and the mean response time for all item types was 65.5 seconds (SD = 29.7). The mean point biserial for 512 dichotomous items was 0.22 (SD = 0.09) and is listed in Table 2. The difficulty index and time for all item types is listed in Table 3.
| Examination Type | Examination Takers, n (%) | Questions, n (%) | Mean (SD) Response Time in Seconds for Each Question |
|---|---|---|---|
| Fundamentals | 5,697 (38.7) | 150 (19.4) | 74.8 (32.8) |
| Medical surgical | 3,679 (25.0) | 225 (29.1) | 60.4 (30.4) |
| Readiness | 5,352 (36.3) | 397 (51.4) | 64.9 (27.3) |
| Total | 14,728 | 772 | 65.5 (29.7) |
| Item Type | Questions, n (%) | Mean (SD) Response Time in Seconds | Mean (SD) Point Biserial |
|---|---|---|---|
| Calculation | 35 (5) | 110.60 (49.6) | 0.31 (.08) |
| Clozea | 46 (6) | 97.80 (28.4) | – |
| Matrixa | 54 (7) | 90.15 (31.9) | – |
| Ordered response | 28 (3) | 74.96 (22.1) | 0.20 (.07) |
| Hot spot | 40 (5) | 72.80 (29.7) | 0.23 (.11) |
| Multiple choice | 393 (51) | 52.73 (16.4) | 0.21 (.08) |
| Multiple responsea | 176 (23) | 65.91 (27.7) | – |
| Total | 772 | 65.51 (29.7) | 0.22 (.08) |
| Difficulty Index | Questions (n) | Mean (SD) Response Time in Seconds |
|---|---|---|
| <0.25 | 34 | 80.71 (41.7) |
| 0.26 to 0.74 | 464 | 70.05 (30.3) |
| >0.75 | 274 | 55.73 (23.8) |
| Total | 772 | 65.43 (29.7) |
An ANOVA was conducted to compare response time and the different item types. There was a significant difference in time and the item types at the p < .05 level, F (6,766) = 61.1, p < .001. The Tukey post hoc test results showed there was a higher time spent on answering calculation items (M = 110.6 [SD = 49.5]) compared with multiple-choice items (M = 52.7 [SD = 16.3]). When calculation items were removed, there was a statistical difference in time and the other question types, F (5,731) = 56.6, p < .001.
An ANOVA was conducted to compare response time and the different exams: (1) fundamentals; (2) medical-surgical; and (3) readiness. There was a statistical difference in time spent on questions and the three types of exams, F (2,759) = 10.9, p < .001. Tukey post hoc results revealed the fundamentals exam had a longer answer time (M = 74.7 [SD = 32.8]) than the medical-surgical exam (M = 60.4 [SD = 30.3]); t tests were conducted between answering time on two exams, medical-surgical (M = 60.4 [SD = 30.3]) and readiness (M = 64.9 [SD = 27.3]); there was no statistical significance (p = .06 two-tailed).
The item types were grouped according to the difficulty index of items listed in Table 3. ANOVA comparing the response time and the difficulty index showed a statistical difference, F (2,770) = 26.3, p < .001. Tukey post hoc results identified that the questions with a difficulty < 0.25 had a mean response time of 80.7 (SD = 41.6) compared with those questions with a difficulty level > 0.75 (M = 55.7 [SD = 23.7]). Subset analysis of the difficulty index follows.
ANOVA was conducted to compare the time and item types in the lowest difficulty index (n = 34). There was a significant difference in the time and item types, F (5,28) = 13.5, p < .001. Tukey post hoc results revealed the calculation items required a significant amount of time (M = 199.5 [SD = 41.7]) compared with multiple-choice items (M = 52.1 [SD = 8.7]). ANOVA was not statistically significant when comparing time and the three exams (p = .31).
ANOVA was conducted to compare the time and item types in the middle difficulty index (n = 464). There was a significant difference in the time and item types F (6,457) = 46.1, p < .001. Tukey post hoc identified that the calculation items required a significant amount of time (M = 134.2, SD = 41.7) compared with the multiple choice (M = 56.8 [SD = 16.8]). ANOVA was statistically significant when comparing the time and the three exams F (2,360) = 11.6, p < .001. The mean time on the fundamentals exam was 82.3 (SD = 34.0) compared with the readiness exam items (M = 69.4 [SD = 27.6]).
ANOVA was conducted to compare time and item types in the highest difficulty index (n = 274). There was a significant difference in the time and item types F (6,267) = 25.2, p < .001. The cloze items required a significant amount of time (M = 92.4 [SD = 37.2]) compared with the multiple-choice items (M = 46.5 [SD = 13.9]). ANOVA was statistically significant when comparing the time and the three exams F (2,271) = 3.3, p = .03. The mean time on the fundamentals exam was 59.9 (SD = 25.2) compared with the medical-surgical exam items (M = 48.9 [SD = 20.4]).
ANOVA was conducted to compare the point biserial and item types. The point biserial was calculated on dichotomously scored questions: (1) calculation; (2) ordered response; (3) hot spot; and (4) multiple choice. There was a significant difference in the point biserial and item types F (4,507) = 10.4, p < .001. The calculation questions had a higher point biserial (M = 0.31 [SD = .08]) compared with the ordered response questions (M = 0.20 [SD = .07]). ANOVA was statistically significant when comparing the point biserial and the exams F (2,509) = 10.5, p < .001. The mean point biserial on the fundamentals exam was 0.26 (SD = .08) compared with the medical-surgical exam (M = 0.21, SD = .09).
Finally, an ANOVA was conducted to compare the discrimination index and item types. There was a statistical difference between the discrimination index and item types F (6,766) = 20.4, p < .001. Tukey post hoc analysis indicated the calculation items (M = 0.29 [SD = 0.13]) were different from the matrix items (M = 0.15 [SD = 0.15]). When comparing the discrimination and exams, the ANOVA reported a statistical difference F (2,770) = 12.1, p < .001. The fundamentals exam (M = 0.23 [SD = .11]) discrimination index was higher than the readiness exam (M = 0.18 [SD = 0.9]).
Discussion
Nursing examinations measure knowledge learned throughout a nursing curriculum, including didactic and clinical instruction. The NCSBN has recently added alternate format items, as studied in this research, into the 2023 NGN blueprint. The NGN CAT exam presents a unique set of items to each candidate. There is limited information about the amount of time necessary to complete the newer exam items.
Analysis of exam and item types are important when assessing an examination. In education, learning objectives often drive evaluation by use of examinations. We evaluated the item types from exams commonly used to evaluate content learning in the fundamentals and medical-surgical courses. As students prepare to sit for the licensure exam, the readiness exam items also were included in the research. Item analysis focused on time, difficulty index, discrimination index, and point biserial, which are commonly used by educators to assess exam items.
This study examined seven different item types, including NGN items. Calculation questions required more time in all examinations to complete with an average of 110 seconds (approximately 2 minutes), followed by cloze (98 seconds) and matrix (90 seconds) item types. Drug calculation questions typically are taught and practiced early in a nursing curriculum. Therefore, faculty should allow for more time as the material is learned and tested on in exams but also allow for more time to practice and prepare for the NCLEX-RN. Nursing faculty who are developing examinations and assigning these item types will now be aware of how much time is required.
Three types of exams were evaluated; the item types are listed in Table 1. As identified, the fundamentals exam items took more time, with an average of 74 seconds. Faculty now know that early learners need additional time to process information as they become familiar with the various item types. Also, students who took the medical-surgical examination averaged 60 seconds per item. This supports previous information regarding the 1 minute per item rules (Bridgeman et al., 2007; Renner & Renner, 1999; Rudolph et al., 2019; Svinicki & McKeachie, 2014).
The items with difficulty indices were observed in this study. We found that the more difficult a question was, the more time students needed on the item (Table 3). Difficult questions, those with an index < 0.25, require more time. The leading item type was the calculation questions. Alternatively, items that were considered easier, with a difficulty index of at least 0.75, required less time. It should be noted that the item types were statistically different and that the cloze item type took more time than the other items.
The result of this research contributes to nursing education by providing a time analysis of item types used in nursing education. The results indicate that early nursing students require more time on examinations administered early in a curriculum. Time should be allocated for more difficult item types. Nursing faculty now have the research to support 1 minute for multiple choice items and additional time for other item types.
Limitations
Limitations to the study included the lack of analysis of bowties item types. This dataset was a secondary analysis, and no demographic variables were collected from the nursing students who took the exams, such as gender, age, or geographic location. It is noted that these exams were only given to nursing students and no other preprofessional students.
Conclusion
Nursing educators have evidence to support the time necessary for item types in nursing examinations. The examination of time on multiple-choice items supports the 1 minute per question practice. Alternate format item types used on nursing examinations require more than 1 minute. Nursing faculty can embed time allotted for the exam based on the item types and placement in the nursing curriculum.
Bridge, P. D., Musial, J., Frank, R., Roe, T., & Sawilowsky, S. (2003). Measurement practices: Methods for developing content-valid student examinations. Medical Teacher, 25(4), 414–421. 10.1080/0142159031000100337 PMID: 12893554
Bridgeman, B., Laitusis, C. C., & Cline, F. (2007). Time requirements for the different item types proposed for use in the revised SAT ®. ETS Research Report Series, 2007(2).i– 21. 10.1002/j.2333-8504.2007.tb02077.x
Brothen, T. (2012). Time limits on tests. Teaching of Psychology, 39(4), 288–292. 10.1177/0098628312456630
Hess, B. J., Johnston, M. M., & Lipner, R. S. (2013). The impact of item format and examinee characteristics on response times. International Journal of Testing, 13(4), 295–313. 10.1080/15305058.2012.760098
Klein Entink, R. H., Kuhn, J.-T., Hornke, L. F., & Fox, J.-P. (2009). Evaluating cognitive theory:Ajoint modeling approach using responses and response times. Psychological Methods, 14(1), 54–75. 10.1037/a0014877 PMID: 19271848
Lane, S., Raymond, M. R., & Haladyna, T. M.(Eds.). (2015). Handbook of test development(2nd ed.). Routledge. 10.4324/9780203102961
National Council of State Boards of Nursing. (2023). Next Generation NCLEX-RN®test plan. https://www.ncsbn.org/public-files/2023_RN_Test%20Plan_English_FINAL.pdf
Renner, C. H., & Renner, M. H. (1999). How to createagood exam. In Perlman B., McCann L. I., & McFadden S. H.(Eds.), Lessons learned: Practical advice for the teaching of psychology(pp. 43–47). American Psychological Society.
Rudolph, M. J., Daugherty, K. K., Ray, M. E., Shuford, V. P., Lebovitz, L., & DiVall, M. V. (2019). Best practices related to examination item construction and post-hoc review. American Journal of Pharmaceutical Education, 83(7), 7204. 10.5688/ajpe7204 PMID: 31619832
Svinicki, M. D., & McKeachie, W. J. (2014). McKeachie's teaching tips: Strategies, research, and theory for college and university teachers(14th ed.). Wadsworth, Cengage Learning.
Tsaousis, I., Sideridis, G. D., & Al-Sadaawi, A. (2018). An IRT-multiple indicators multiple causes(MIMIC) approach asamethod of examining item response latency. Frontiers in Psychology, 9, 2177. 10.3389/fpsyg.2018.02177 PMID: 30542303
From Wolters Kluwer Health, New York, New York (VM, SC, OM); and Saint Louis University, St. Louis, Missouri (HI).
Disclosure: The authors have disclosed no potential conflicts of interest, financial or otherwise.
Copyright 2025, SLACK Incorporated
