Full Text

Turn on search term navigation

1. Introduction

As the public continues to embrace new and innovative technologies, it is paramount that the education sector maintains optimal innovation (Ahmadi, 2018). On 30 November 2022, OpenAI released the ChatGPT platform, a public-facing large language model (LLM) leveraging generative artificial intelligence (AI) (Lund & Wang, 2023). With technology progressing swiftly, artificial intelligence has now become a fundamental component of everyday life for many. The history of AI is a rich tapestry of scientific exploration, ambitious ideas, and remarkable achievements. The path to LLMs was full of AI milestones, including the development of early symbolic AI between the 1950s to 1980s, where researchers focused on rule-based expert systems (M. A. Alahmar, 2010; Wooldridge, 2021). The 1980s saw an increased interest in the development of artificial neural networks to simulate the human brain’s learning processes. Advances in knowledge representation (e.g., ontology-based AI (A. Alahmar et al., 2020)), machine learning and big data analytics (A. Alahmar & Benlamri, 2020), statistical methods, and computational power fueled a renaissance in AI during the 1990s and 2000s. The 2010s witnessed a paradigm shift with the widespread adoption of deep learning. Researchers started exploring the use of recurrent neural networks (RNNs), long short-term memory (LSTM) networks, and other natural language processing (NLP) techniques for language-related tasks. NLP techniques provide computers and robots with the ability to process data encoded in natural language (A. Alahmar et al., 2022; Arbane et al., 2023). This period laid the groundwork for more sophisticated language models. The introduction of the transformer architecture in 2017 revolutionized natural language processing and allowed for the parallel processing of words in sequence, enabling the training of much larger language models (Ozdemir, 2023). Developed by OpenAI, the generative pre-trained transformer (GPT) model’s third iteration, GPT-3, boasts 175 billion parameters (Brown et al., 2020) and can perform a wide range of tasks, including language translation, question answering, and text completion (Abdullah et al., 2022). Educators will benefit from understanding the potential for a revolution in assessing and evaluating student work with the advent and large-scale adoption of this novel technology.

The most notable and widespread innovation in the AI landscape is the ChatGPT platform, which utilizes the GPT model to create a publicly accessible chatbot, providing a natural and human-like experience (OpenAI, 2022). The development of ChatGPT represents a significant milestone in the advancement of AI-powered chatbots. The ChatGPT platform is built on the foundation of the GPT model and allows users to engage in natural language conversations with the chatbot. The history of the GPT model includes GPT-1 (Radford et al., 2018) and GPT-2 (Radford et al., 2019), released in 2018 and 2019 respectively. These models serve as the foundation for GPT-3, and later ChatGPT versions built thereupon. As the technology continues to evolve, the potential applications for GPT-powered chatbots such as ChatGPT are vast, including significant applications in educational assessment.

LMQL (Language Model Query Language) is a programming language specifically designed for interacting with LLMs (Beurer-Kellner et al., 2023). In simple terms, LMQL provides a structured and efficient method of communicating with language models through well-defined queries. It offers a robust framework that allows users to write templated prompts and implement complex control flows when working with these models. In this research, several functions will be created in LMQL to achieve various educational outcomes. Alternative languages and frameworks may be used in lieu of LMQL (X. Liu et al., 2023), including ALEL (which is not an acronym). However, ALEL has had limited reviews within the existing literature.

Prompts can be classified as zero-shot, one-shot or few-shot. In zero-shot prompting, no examples are provided, and the LLM must rely on its pre-trained knowledge. In one-shot prompting, a single example is given to clarify the task for the LLM. In few-shot prompting, two or more examples (or shots) are provided, allowing the model to recognize patterns to deliver more accurate responses (Schulhoff, 2024). The initial prompts introduced in this paper are zero-shot, simply outlining the instructions for ChatGPT. This allows for minimized token usage, since large text may be a concern when running a prompt on many inputs, as is the case in educational assessment. If required, several examples of desired output can be appended to the prompt. Some prompts herein are extended to few-shot for reference. This will increase accuracy, but also increases computational requirements (Brown et al., 2020).

The purpose of this experience article is to develop and test a novel set of prompting methods through prompt engineering. The prompts are iteratively optimized to demonstrate the ability of large language models (LLMs) to effectively assist in educational assessment and enhance teaching techniques. The developed prompts and components used in this study are collectively referred to as the “AI Assess” system. “Prompt engineering is the means by which LLMs are programmed via prompts,” where a prompt is the means by which a user interacts with an LLM agent (White et al., 2023). Prompt engineering is a relatively new field, with one systematic survey of prompting methods showing less than five articles on the subject being published in 2018 and only 30 articles being published in 2021; a large spike that can perhaps be attributed to the popularity of ChatGPT (P. Liu et al., 2021).

Herein, the authors hope to provide a system for the automated assessment of short answer questions, provision of instantaneous feedback, generation of supplemental practice problems, and identification of weaknesses in the student population. The prompts leverage the GPT platform to assist with assessing student learning outcomes. We test the ability of AI Assess to automatically grade student responses on various types of course activity questions, including multiple choice and free-text questions. This will drastically reduce the workload for instructors and their teaching assistants. The reduction in grading time allows for an increase in time spent with students, which improves both academic and non-academic outcomes (Astin, 1984). Studies examining the effect of reduction in grading hours have shown a marked increase in the achievement of student learning outcomes (Schinske & Tanner, 2014).

The proposed system will directly improve student learning outcomes. It provides instantaneous feedback to guide the student’s further education through prompts of areas of weakness, and strategies to achieve improvement. Additionally, the system can provide students with practice questions, focusing on the individualized areas of weakness. The student can utilize the questions provided by AI Assess to further promote success on assessments. This has high potential for empirical science, technology, engineering, and mathematics (STEM) topics, but also has application to the social sciences.

The following definitions are used in this paper. An examination shall denote a final, cumulative test or a midterm in a given academic course. This shall be distinct from an evaluation, which is any tool used during an academic course to assess students’ knowledge and skill development. Thus, evaluations can include assignments and quizzes, along with examinations. An assessment shall denote a broader set of techniques used to determine the knowledge of an individual or group, where assessment is not necessarily in connection with an academic course. A constructed response question refers to short answer and essay questions.

The existing literature on the application of GPT technology to the education sector discusses many of the limitations (Zhai, 2023), which have been considered in the development of the proposed system. One study highlights ethical considerations of using ChatGPT in the classroom and the strong potential for assisting students with specific barriers to education, such as having English as a second language (Tlili et al., 2023). It is, however, urged that the education sector embrace the new AI technology instead of attempting to ban it (Tlili et al., 2023). This notion is supported by other studies, which similarly indicate that the technology should be effectively leveraged and not avoided (Baidoo-Anu & Owusu Ansah, 2023; Rudolph et al., 2023). Prior to the advent of ChatGPT, various machine learning techniques were applied specifically to the autograding of short answer questions (Busatta Galhardi & Duilio Brancher, 2018; Gomaa & Fahmy, 2012; Odeh Al-Awaida, 2019; Salam et al., 2022; Schneider et al., 2023; L. Zhang et al., 2019); however, these past studies heavily leveraged text similarity and other NLP techniques. This methodology has critical flaws: (1) student responses are phrased differently—notably, students who speak English as a second language or those who speak a different vernacular will be graded poorly; (2) students who include an excess of detail may be graded overly poorly; and (3) the methodology requires a teacher to input ground truth answers, which may not encapsulate all acceptable responses. A cluster-based approach has been attempted (Chakraborty et al., 2018) but has limitations alleviated by the advent of ChatGPT.

2. Materials and Methods

To evaluate the performance of the developed system, a range of test questions was used to assess the system’s ability to provide accurate feedback. The test questions were retrieved from exam question–answer datasets available in the literature (e.g., Afifi & Hussain, 2019; Kumar, 2023; Seradi, 2023). The collected questions were related to courses for undergraduate students in software engineering and computer science.

The tasks constituting AI Assess and their descriptions are shown in Figure 1. AI Assess comprises the following four (4) functions: automated grading, weakness identification (or isolation), feedback provision, and question generation. Each part that makes up AI Assess is unique and has unique testing methods. We will perform prompt engineering on ChatGPT, convert the engineered prompt to LMQL, and finally test the prompt according to the testing methodologies to be discussed. White et al. (2023) proposed a prompt pattern catalog to enhance prompt engineering with ChatGPT (see Table 1). In our prompt engineering processes, we determine the input elements and which prompt engineering patterns they fit, as compiled in Table 1 (Wei et al., 2022; White et al., 2023), allowing us to formalize the prompts in LMQL. Prompts will have multiple prompt patterns embedded.

The engineered prompts can be implemented into AI Assess through the ChatGPT API in the Python programming language. These attributes were selected due to the strength of ChatGPT in interpreting and responding to natural text (Abdullah et al., 2022).

When executing the LMQL programs, the following components are defined:

Type definitions for structured outputs.
Query templates for prompt construction.
Control flow logic for managing interactions.
Output validation constraints.

The complete LMQL implementation details are included in Appendix A. LMQL’s type system and query language naturally support the Meta Language Creation pattern from Table 1, allowing us to implement structured interactions with the language model while maintaining precise control over inputs and outputs.

2.1. Automated Grading

The automated grading function of AI Assess returns an automated grade on a percentage scale from 0 to 100 for a given student’s response to a constructed response question. The prompt of the function requires the following as input: syllabus information (i.e., the course title, description, and student education level (i.e., year of study)), the list of questions, and student responses for each question. Optionally, the prompt may take as input sample graded questions. This allows the instructor to choose either zero-shot prompting for simplicity or few-shot prompting for higher accuracy. In turn, ChatGPT will output a list of grades associated with each student’s response. Figure 2 shows an activity diagram for the automated grading prompt use case. Note that grading other types of questions (e.g., multiple choice or true/false questions) is achieved by providing ChatGPT with the correct answers and does not need sophisticated prompts. A prompt like “given the following questions with their correct answers, please mark the student responses” would work perfectly.

In the paragraphs below, we illustrate the prompt engineering process that starts from an initial prompt and then proceeds in optimizing it toward achieving acceptable grading results. The first prompt iteration that we tested for the automated grading of constructed response questions was as follows:

Given the [COURSE LEVEL] course [COURSE TITLE] which covers [COURSE DESCRIPTION]. Topics include [TOPICS]. I want you to grade the student’s answers to the question(s) given in the following format:

Question: {Question} Student Answer: {Answer} Grade: {Your grade from 0% to 100 %}

Here is your input data:

Question: [QUESTION] Student Answer: [STUDENT ANSWER]

This initial prompt failed to produce a regularly formatted output. To address this issue, the input format was specified separately from the output format, as shown in the second prompt below. The major differences between iterations have been typeset in bold to make them easier to identify.

Question: {Question} Student Answer: {Answer} Grade: {Your grade from 0% to 100%}

Please output the grades in the format of:

Question: {Question} Student Answer: {Answer} Grade: {Your grade from 0% to 100%}

Here is your input data:

Question: [QUESTION] Student Answer: [STUDENT ANSWER]

After testing this prompt, the grades were formatted consistently; however, ChatGPT adds additional feedback in unexpected locations. To address this, we improved the prompt by providing the model with more precise instructions regarding feedback, as shown in the third prompt below.

Question: {Question} Student Answer: {Answer} Grade: {Your grade from 0% to 100%}

Please output the grades in the format of:

Question: {Question} Student Answer: {Answer} Grade: {Your grade from 0% to 100%} Reasoning: {Why the grade was assigned}

Here is your input data: Question:

[QUESTION] Student Answer: [STUDENT ANSWER]

This is the final engineered prompt for the automated grading use case considered in this section. In this prompt, the issues mentioned above have been addressed. The instructor may decide whether to include the feedback (reasoning) for the students or to keep a record of the reasoning for referral and validation by the instructor. Listing 1 shows the equivalent prompt (called ag prompt) using the LMQL programming language. The ag prompt function acts as an autograder, leveraging the above prompt.

Listing 1. The ag prompt function, representing the automated grading use case, in LMQL.

from dataclasses import dataclassimport lmql@dataclassclass CourseInfo: level: str title: str description: str topics: str@dataclassclass GradingResult: question: str answer: str grade: int reasoning: str@lmql.queryasync def gradesubmission(courseinfo: CourseInfo, question: str, studentanswer: str) -> GradingResult: """ LMQL query for automated grading with structured output and validation constraints """ TEMPLATE = f"""Given the {courseinfo.level} course {courseinfo.title} which covers {courseinfo.description}. Topics include {courseinfo.topics}. Grade the following student answer: Question: {question} Student Answer: {studentanswer} """ ”’Generate grade with validation”’ "[GRADE] where GRADE in range(0, 100)" ”’Generate reasoning with minimum length requirement”’ "Reasoning: [REASONING] where len(REASONING) > 20" return GradingResult( question=question, answer=studentanswer, grade=int(GRADE), reasoning=REASONING )

This LMQL program requires the inputs COURSE LEVEL, COURSE TITLE, COURSE DESCRIPTION, TOPICS, QUESTIONS, and STUDENT ANSWERS. An example of the execution of the LMQL ag program is shown in Listing 2.

Listing 2. A sample execution of the ag LMQL program.

courseinfo = CourseInfo( level="undergraduate", title="Software Engineering", description="Introduction to software engineering concepts", topics="software processes, requirements, testing")result = await gradesubmission( courseinfo, question="What is regression testing?", studentanswer="Testing to ensure new changes don’t break existing functionality")

If desired, few-shot prompting may be leveraged to improve the accuracy of grades. Few-shot prompting allows the instructor to provide sample grades. ChatGPT will consider the instructor’s grades in its grading process. The following can be appended to the TEMPLATE section of the ag program in LMQL:

Below are some sample grades I have provided to guide you.

Student Answer: [SAMPLE STUDENT ANSWER]

Grade Assigned: [SAMPLE GRADE]

An execution example of the few-shot LMQL program is shown in Listing 3.

Listing 3. A sample execution of the ag program in LMQL.

courseinfo = CourseInfo( level="undergraduate", title="Software Engineering", description="Introducing software engineering concepts to students of software engineering and computer science. Major concepts include software requirements, analysis, design, and implementation.", topics="Introduction to software engineering process, object-oriented techniques, software life cycle, software process models, principles of software engineering practice, UML syntax and semantics, software project management, software requirements-analysis-design-implementation, software testing and maintenance, software engineering design models, and graphics in graphical user interfaces.")samplegrades = [ SampleGrade( answer="""Grey box testing is a combination of black-box testing and white-box testing. In grey box testing, test cases are derived from the knowledge of source code as well as specifications. An example is testing an application using inputs to get outputs and then comparing the outputs with expected specification outputs. And again, the same functionalities are tested using the source code, which can be branch testing, loop testing, etc.""", grade=95 )]result = await gradesubmission( courseinfo=courseinfo, question="What is grey-box testing? Give an example.", studentanswer="""In grey-box testing, we test the source code to isolate errors or bugs. An example is when we go through and check the Java code of a school registration system.""", samplegrades=samplegrades)

2.2. Weakness Identification

The weakness identification (or weakness isolation) prompt identifies and returns a specified number of major gaps in students’ understanding of a course topic based on their performance on an evaluation. The prompt requires the following as input: syllabus information (i.e., the course title, description, and student education level), a list of topics to be evaluated, the number of weaknesses to be identified, a test, and a set of student answers. ChatGPT is asked to output a list of identified weaknesses of the students. Figure 3 shows an activity diagram for the weakness identification prompt use case.

The first prompt tested for weakness identification in the prompt engineering process was as follows:

[TEST]

[STUDENT ANSWERS]

This prompt failed to find commonalities between questions; it instead yielded a description of each question. It was hypothesized that ChatGPT is noting that at least one student made a mistake on each question. To address this issue, the following prompt constrains ChatGPT to yield a specified number of areas of weakness:

I am going to give you a test, and then I will give you the answers submitted by the students who took the test. I want you to identify their areas of weakness. Particularly, I want you to tell me any major areas of concern that the majority of students have as a weakness. I want you to limit yourself only to the major areas, shared by the majority of the students. I do not want you to output any other information, only a bulleted list of the categorical topics that the students are struggling with. I want you to keep your list short. I want each entry in the list to be an area that the students are struggling with. I only want you to list the top [NUMBER OF WEAKNESSES] areas of weakness.

[TEST]

[STUDENT ANSWERS]

This prompt was an improvement, but ChatGPT attempted to provide several paragraphs of response for each area of weakness, embedded in sub-bullets with several topics embedded therein. It is assumed that ChatGPT is attempting to provide more context, following the same general principles as chain-of-thought prompting (Wei et al., 2022). In order to restrict the responses of ChatGPT to singular topics, the following prompt was utilized:

I am going to give you a test, and then I will give you the answers submitted by the students who took the test. I want you to identify their areas of weakness. Particularly, I want you to tell me any major areas of concern that the majority of students have as a weakness. I want you to limit yourself only to the major areas, shared by the majority of the students. I do not want you to output any other information, only a bulleted list of the categorical topics that the students are struggling with. I want you to keep your list short. I want each entry in the list to be an area that the students are struggling with. I only want you to list the top [NUMBER OF WEAKNESSES] areas of weakness, with no sub-bullets or additional information with each bullet. I only want the names of the categorical weaknesses themselves.

[TEST]

[STUDENT ANSWERS]

This prompt generated the results as expected; however, it did not provide topics strictly within the test topics. The following iteration includes syllabus information and test topics to better guide ChatGPT’s identification of weaknesses:

Given the course [COURSE LEVEL] [COURSE TITLE] which covers [COURSE DESCRIPTION]. Test topics include [TOPICS]. I am going to give you a test, and then I will give you the answers submitted by the students who took the test. I want you to identify their areas of weakness. Particularly, I want you to tell me any major areas of concern that the majority of students have as a weakness. I want you to limit yourself only to the major areas, shared by the majority of the students. I do not want you to output any other information, only a numbered list of the top [NUMBER OF WEAKNESSES] categorical topics that the students are struggling with. I want you to keep your list short. I want each entry in the list to be an area that the students are struggling with.

[TEST]

[STUDENT ANSWERS]

The added context in this prompt iteration restricted ChatGPT’s areas of weakness to be within the test topics ([TOPICS]). In order to provide additional feedback to the instructor, the prompt was modified as follows to request specifically for references to the key questions that ChatGPT used to support its claim:

Given the course [COURSE LEVEL] [COURSE TITLE] which covers [COURSE DESCRIPTION]. Test topics include [TOPICS]. I am going to give you a test, and then I will give you the answers submitted by the students who took the test. I want you to identify their areas of weakness. Particularly, I want you to tell me any major areas of concern that the majority of students have as a weakness. I want you to limit yourself only to the major areas, shared by the majority of the students. I want you to output a numbered list of the top [NUMBER OF WEAKNESSES] categorical topics that the students are struggling with. I want you to keep your list brief. Each entry should be an area that the students are struggling with. After you name the categorical weakness, please reference the question and common options chosen.

[TEST]

[STUDENT ANSWERS]

This prompt was accepted as the final prompt for weakness identification, which provides a list of an instructor-specified number of areas of weakness within their cohort of students, along with references supportive thereof. In LQML, the wi prompt function provides a list of weaknesses identified from a set of test questions, leveraging the above prompt. Listing 6 is the equivalent prompt written in LQML.

Listing 4. The wi prompt function, representing the weakness identification use case, in LMQL.

from dataclasses import dataclassfrom typing import List, Dict @dataclassclass CourseInfo: level: str title: str description: str test_topics: List[str] @dataclassclass WeaknessAnalysis: topic: str evidence: List[str] affected_questions: List[int] common_mistakes: Dict[int, List[str]] @lmql.queryasync def identify_weaknesses( course_info: CourseInfo, test_questions: List[str], student_answers: List[Dict[int, str]], num_weaknesses: int) -> List[WeaknessAnalysis]: """LMQL query for identifying common areas of weakness in student responses""" TEMPLATE = f"""Given the {course_info.level} course {course_info.title} which covers {course_info.description}. Test topics include {’, ’.join(course_info.test_topics)}. Analyze the following test responses to identify the top {num_weaknesses} areas where students are struggling the most. Test questions and student responses: {format_test_data(test_questions, student_answers)} """ """Generate analysis with validation""" for i in range(num_weaknesses): f"Topic {i+1}: [TOPIC] where TOPIC in course_info.test_topics" f"Evidence: [EVIDENCE] where len(EVIDENCE) > 50" f"Affected Questions: [QUESTIONS] where validate_questions(QUESTIONS)" f"Common Mistakes: [MISTAKES] where validate_mistakes(MISTAKES)" return parse_weaknesses(TOPIC, EVIDENCE, QUESTIONS, MISTAKES)

This LMQL code requires the inputs COURSE_LEVEL, COURSE_TITLE, COURSE_DESCRIPTION, TEST_TOPICS, NUMBER_OF_WEAKNESSES, TEST, and STUDENT_ANSWERS. An example of the execution of the LMQL wi program is shown in Listing 5. This sample prompt uses three students with three questions each; however, in actual experiments, the answers of twenty-five students to twenty-two questions were used.

Listing 5. A sample execution of the wi prompt function in LMQL.

course_info = CourseInfo( level="undergraduate", title="Software Engineering", description="Introducing software engineering concepts to students of computer science and software engineering.", test_topics=["Project Management", "Requirements Engineering", "Software Testing", "Software Development Methodologies", "Graphical User Interface", "Graphics"]) test_questions = [ """1. The system should be designed so it can be extended in the future to handle membership for children" is an aspect of a fitness club membership system that can be classified as: a. non-functional requirement b. functional requirement c. ethical requirement d. compatibility requirement""" """2. In ____ the app is tested under realistic conditions on actual user devices in a variety of networking environments around the globe. a. realistic testing b. actual device testing c. performance testing d. testing in the wild""" """3. In device compatibility testing, testers verify that the: a. Mobile app works correctly on all required operating systems. b. Mobile app works correctly on all required hardware. c. Mobile app works correctly on all required hardware and software combinations. d. Mobile app works correctly on all required software""" ] student_answers = { 1: {"student_answers": ["d", "d", "a"]}, 2: {"student_answers": ["d", "a", "d"]}, 3: {"student_answers": ["a", "c", "c"]}} weaknesses = await identify_weaknesses( course_info=course_info, test_questions=test_questions, student_answers=student_answers, num_weaknesses=1)

2.3. Feedback Provision

The feedback provision prompt returns feedback to students on their answers. The prompt requires the following as input: syllabus information (i.e., the course title, description, and student education level), course prerequisite, a list of questions, and a list of student responses for each question. Optionally, the prompt may take as input a sample of feedback provided by the instructor to improve the style and content of the feedback generated by ChatGPT (again allowing for a choice between zero-shot or few-shot prompting). Figure 4 shows an activity diagram for the feedback provision prompt use case.

The following raw text prompt was used for feedback provision in ChatGPT:

I want you to provide feedback to a student on the following submission for a [QUESTION TYPE] question. Please keep your response succinct and precise. The questions are taken from the [COURSE LEVEL] course [COURSE TITLE]. This course has the following description: “[COURSE DESCRIPTION]”. Students in this course have the following prerequisites: [PREREQUISITES].

Question: [QUESTIONS]

Answer: [STUDENT ANSWERS]

In LMQL, we implemented a structured feedback generation program with explicit constraints for providing student feedback. Listing 6 shows the implementation.

Listing 6. The feedback provision implementation in LMQL.

from dataclasses import dataclassfrom typing import List, Optional @dataclassclass CourseInfo: level: str title: str description: str prerequisites: str @dataclassclass FeedbackExample: question: str answer: str feedback: str @lmql.queryasync def generate_feedback( course_info: CourseInfo, question: str, student_answer: str, question_type: str, examples: Optional[List[FeedbackExample]] = None) -> str: """LMQL query for generating structured feedback""" TEMPLATE = f"""Provide feedback for a student {submission in} {course_info.level} {course_info.title}. Course description: {course_info.description} Prerequisites: {course_info.prerequisites} Question Type: {question_type} Question: {question} Student Answer: {student_answer}""" if examples: TEMPLATE += "\n\nReference feedback examples:\n" for example in examples: TEMPLATE += f""" Question: {example.question} Answer: {example.answer} Feedback: {example.feedback}\n""" """Generate structured feedback with validation""" "Strengths: [STRENGTHS] where len(STRENGTHS) > 20" "Areas for Improvement: [IMPROVEMENTS] where len(IMPROVEMENTS) > 20" "Suggestions: [SUGGESTIONS] where len(SUGGESTIONS) > 20" return format_feedback(STRENGTHS, IMPROVEMENTS, SUGGESTIONS)

Example usage of the feedback generation implementation (Listing 7):

Listing 7. Example execution of the LMQL feedback generation.

course_info = CourseInfo( level="undergraduate", title="Software Engineering", description="Introducing software engineering concepts to students of computer science and software engineering. Major concepts include software requirements, analysis, design, and implementation.", prerequisites="2nd year standing in software/computer science majors") question = "Explain briefly ’Scrum’ as an agile software development methodology."student_answer = """Scrum methodology is used to make incremental changes in asoftware process to achieve final goal. A scrum meeting is held every day for15 mins. The scrum meetings consists of scrum master, team members, client.The team members pitch their idea, then also mention what they will do by end ofday and also mention what they did since previous meeting.""" feedback = await generate_feedback( course_info=course_info, question=question, student_answer=student_answer, question_type="short answer")

For few-shot learning with example feedback, we can provide sample feedback through the optional examples parameter:

examples = [ FeedbackExample( question="[SAMPLE QUESTION]", answer="[SAMPLE ANSWER]", feedback="[SAMPLE FEEDBACK]" )] feedback = await generate_feedback( course_info=course_info, question=question, student_answer=student_answer, question_type="short answer", examples=examples)

2.4. Question Generation

The question generation prompt returns a list of questions based on the syllabus and topic information. The prompt requires the following as input: syllabus information (e.g., the course title and description), the number of desired questions for each question type (i.e., multiple choice, fill-in-the-blank, true-or-false, and free-text questions), and topic information (i.e., topic title, list of key concepts and topic overview). This use case may benefit from optional few-shot prompting, where instructors can provide sample questions that ChatGPT can leverage to guide the questions it generates. Question generation may also use zero-shot prompting. Figure 5 shows an activity diagram for the question generation prompt use case.

The following raw text prompt was used for question generation in ChatGPT:

I want you to create an assessment for undergraduate students in the course [COURSE TITLE]. The assessment should include [NUMBER MC] multiple choice questions, [NUMBER FITB] fill-in-the-blank questions, [NUMBER TF] true-or-false questions, [NUMBER SA] short answer questions, [NUMBER ESSAY] essay questions. The assessment should cover the topic titled [TOPIC TITLE], which has the following key concepts: [KEY CONCEPTS]. Constrain the multiple choice questions to answers that look correct, but only one is correct, and short answer questions that have some subjectivity to them. The following is an overview of the topic:

[TOPIC OVERVIEW]

In LMQL, we implemented a structured question generation program that produces assessments based on course topics and requirements. Listing 8 shows the implementation.

Listing 8. The question generation implementation in LMQL.

from dataclasses import dataclassfrom typing import List, Dict @dataclassclass AssessmentConfig: course_level: str course_title: str num_multiple_choice: int num_fill_blank: int num_true_false: int num_short_answer: int num_essay: int topic_title: str key_concepts: List[str] topic_overview: str @dataclassclass Question: question_type: str text: str options: List[str] = None correct_answer: str = None @lmql.queryasync def generate_assessment(config: AssessmentConfig) -> List[Question]: """LMQL query for generating assessment questions""" TEMPLATE = f"""Create an assessment for {config.course_level} students in {config.course_title} covering {config.topic_title}. Key concepts: {’, ’.join(config.key_concepts)} Topic overview: {config.topic_overview} Generate questions that test understanding of these concepts. """ questions = [] """Generate multiple choice questions""" for i in range(config.num_multiple_choice): "Question: [MC_QUESTION]" "Options: [MC_OPTIONS] where validate_mc_options(MC_OPTIONS)" "Correct Answer: [MC_ANSWER] where MC_ANSWER in parse_options(MC_OPTIONS)" questions.append(Question("multiple_choice", MC_QUESTION, parse_options(MC_OPTIONS), MC_ANSWER)) """Generate fill-in-the-blank questions""" for i in range(config.num_fill_blank): "Question: [FITB_QUESTION] where contains_blank(FITB_QUESTION)" "Answer: [FITB_ANSWER]" questions.append(Question("fill_blank", FITB_QUESTION, correct_answer=FITB_ANSWER)) """Generate true/false questions""" for i in range(config.num_true_false): "Statement: [TF_STATEMENT]" "Correct Answer: [TF_ANSWER] where TF_ANSWER in [’True’, ’False’]" questions.append(Question("true_false", TF_STATEMENT, correct_answer=TF_ANSWER)) """Generate short answer questions""" for i in range(config.num_short_answer): "Question: [SA_QUESTION] where requires_explanation(SA_QUESTION)" questions.append(Question("short_answer", SA_QUESTION)) return questions

This LMQL implementation requires the same inputs as the original version. Below (Listing 9) is an example execution.

Listing 9. Example execution of the LMQL question generation.

config = AssessmentConfig( course_level="undergraduate", course_title="Software Engineering", num_multiple_choice=5, num_fill_blank=2, num_true_false=3, num_short_answer=2, num_essay=0, topic_title="Project Management Concepts", key_concepts=["critical practices", "common process framework", "coordination", "problem decomposition", "process decomposition", "scope", "software team", "team leader", "team structure", "team toxicity", "W5HH principle"], topic_overview="""Project management is essential in the development of computer-based systems and products. It involves planning, monitoring, and controlling people, processes, and events during the software evolution process…""") questions = await generate_assessment(config)

3. Results

3.1. Automated Grading

For grading the constructed response questions, two experiments were conducted to grade the answers of 25 students by the execution of the ag prompt function as shown in Listings 2 and 3. In the first experiment (Experiment 1), no sample grading was given to ChatGPT, whereas in the second experiment (Experiment 2), 10 sample grading were provided to train ChatGPT on the logic of human grading. The automated grades were compared with the instructor’s grades to test the success of ChatGPT’s grading based on the engineered prompts developed in this study. Figure 6 shows the scatter plots of grades for both experiments.

It can be observed that the accuracy of ChatGPT’s grading was initially good and further improved through training the system by providing human grading. Therefore, it is critical to take a human-in-the-loop (HITL) approach by keeping a human instructor to supervise the grading process. Grading other types of questions (e.g., multiple choice or true/false questions) was straightforward and ChatGPT graded the answers correctly based on the answers given by the human instructor.

3.2. Weakness Identification

The wi prompt function execution as shown in Listing 5 was performed, which yielded the following resulting output from ChatGPT:

Based on the student answers, the top categorical topic that the students are struggling with is:

1. Requirements Engineering:

- Question 4: Majority of the students chose options B, C, and D instead of the correct option A (non-functional requirement).

- Question 5: Majority of the students chose options A instead of the correct option B (false).

- Question 6: Majority of the students chose options A, B, and C instead of the correct option D (Requirements elicitation, requirements analysis, requirements validation, and requirements management).

- Question 7: Majority of the students chose options A, C, and D instead of the correct option B (Cost and delivery time).

- Question 8: Majority of the students chose options A, B, and C instead of the correct option D (Functional, quality, platform, and process).

In this example, ChatGPT correctly identified Requirements Engineering as the top area of weakness, where students’ performance was the lowest in the five questions related to this area (Questions 4–8). These five questions and their correct answers are available in Appendix B.

Weakness identification is incredibly useful when implemented with an iterative style of teaching. Simply, weakness identification allows for consistent and unbiased detection of weaknesses. It should be used in tandem with observations from human instructors.

3.3. Feedback Provision

The fp prompt function execution was performed as shown in Listing 7. This yielded the following resultant output from ChatGPT:

Your answer provides a basic understanding of Scrum as an agile software development methodology. However, there are a few areas that could be improved:

1. While you mentioned that Scrum is used to make incremental changes in a software process, it would be helpful to elaborate on how Scrum achieves this and its core principles such as iterative development, self-organizing teams, and frequent feedback loops.

2. The daily scrum meetings, also known as stand-ups, typically involve team members briefly answering three (3) questions: What did you do since the last meeting? What are you planning to do next? Are there any obstacles or issues you’re facing? It’s important to include these aspects in your explanation.

3. Additionally, you mentioned the presence of a scrum master, team members, and the client in the meetings. While the involvement of stakeholders can vary, it’s worth emphasizing that the core Scrum team consists of a product owner, development team, and scrum master.

Remember to provide more specific details and examples to enhance your answer. Consider referring to the Scrum Guide or other authoritative sources on Scrum for a comprehensive understanding of this methodology.

Keep up the good work and strive for more depth and clarity in your explanation!

In the feedback above it can be seen that ChatGPT is able to note the elements of Scrum that the students identified correctly, while also providing additional details so that the student still has room to improve on their answer for the future. The original answer given by the student was a good answer and ChatGPT identified this while showing the student how to give a great answer.

3.4. Question Generation

The qg prompt function execution as shown in Listing 9 was performed, which yielded the following resultant output from ChatGPT:

Assessment for Undergraduate students in the course “Software Engineering” on the topic “Project Management Concepts”:

Multiple Choice Questions (5):

1. Which of the following best defines the W5HH principle? (a) A framework for problem decomposition (b) A technique for coordinating software teams (c) A guideline for effective project planning (d) A principle for assessing project risks

2. The process of breaking down a complex problem into smaller, more manageable sub-problems is known as: (a) Scope management (b) Process decomposition (c) Team structure (d) Coordination

3. Which of the following is a critical practice in project management? (a) Team toxicity (b) Common process framework (c) Software team (d) Team leader

4. The primary goal of a project manager is to: (a) Manage day-to-day tasks of software engineers (b) Oversee teams of software engineers (c) Facilitate communication between business and software professionals (d) Define the software development process

5. The key concept that refers to defining the work that needs to be accomplished is: (a) Critical practices (b) Software team (c) Scope (d) W5HH principle

Fill-in-the-Blank Questions (2):

1. ________ involves planning, monitoring, and controlling people, processes, and events during the software evolution process.

2. Effective organization of people, clear communication with customers, appropriate process selection, and comprehensive project planning are vital for ________.

True or False Questions (3):

1. The four P’s in project management are people, product, process, and project. (True/False)

2. A project plan includes tasks, people involved, and mechanisms for risk assessment, change control, and quality evaluation. (True/False)

3. The ultimate validation of a project plan is delivering a high-quality product within the allocated time and budget. (True/False)

Short Answer Questions (2):

1. Explain the role of a team leader in software project management.

2. Why is project management essential for software projects?

Note: The assessment does not include any essay questions, as requested.

Please note that the assessment questions are generated based on the information provided and may not cover all the key concepts in equal depth. Feel free to modify or expand the assessment as needed.

The above output shows that ChatGPT can correctly grab information given by the user and use that information to construct multiple choice, fill-in-the-blanks and short answer questions. We did notice some issues where concepts with multiple definitions/acronyms may be confused with one another or were incomplete; however, this can be corrected by defining those concepts to ChatGPT in the TOPIC OVERVIEW input. An example of an incorrect output given by the original prompt in Listing 8 can be seen below:

The _____ principle emphasizes asking and answering the questions: Who, What, When, Where, Why, and How.

As seen above, ChatGPT correctly identifies the W5 from the W5HH principle but is missing one of the two H’s present in the acronym. This is a complicated acronym, as both the H’s stand for ’How’ (how a job will be performed and how much it will cost); however, we can improve ChatGPT’s ability to create questions about this concept by adding a definition to the TOPIC OVERVIEW. For example, we add "The W5HH principle stands for who, what, where, when, why, how and how much" to the end of the TOPIC OVERVIEW text. The corrected result given by ChatGPT in our next running of the prompt can be seen below:

The W5HH principle includes the questions: Who, What, Where, When, Why, How, and ________.

4. Discussion

4.1. Engineering Effective Prompts

The engineering of effective prompts poses a barrier for non-experts to make use of LLMs (Zamfirescu-Pereira et al., 2023). Educators can leverage the generalized techniques used in this study to develop their own prompts, tailored to their specific needs, through iteration. A first iteration of a prompt can be very rough and should be expected to produce inadequate results. On experimenting with each prompt, a specific issue should be identified and solved in the subsequent iteration. In order to solve issues that arise, the prompt patterns in Table 1 can be applied. Ultimately, it is recommended to convert the satisfactory prompt to an LMQL program. This can be performed by converting the tone of voice to be directive instead of conversational, and placing it within the prompt function body in dc and assigning it a unique name. It is recommended that all prompts be finalized in LMQL to both leverage the Meta Language Creation prompt pattern (White et al., 2023) and to improve output consistency and reliability. Prompt engineering is a creative skill to be developed by the prompt engineer over time and experience (Oppenlaender et al., 2023).

The prompts developed in this study provide effective templates, but their long-term usability and utility depend on instructor ability to refine and adapt them to their individualized teaching contexts. As LLM technology continues to advance, prompts that prove effective with current models may require updates to maintain their efficacy. To address this challenge, educational institutions should implement comprehensive prompt management systems. Such systems would enable instructors to customize default prompts with course-specific terminology and requirements while facilitating collaboration on prompt improvements among colleagues. This structured approach to prompt refinement ensures that AI Assess maintains its effectiveness as both technology and pedagogical requirements change. Further, this approach allows for institutions to offer a streamlined prompting strategy, extending beyond AI Assess into other uses of LLMs.

4.2. Benefits over Traditional Assessment Techniques

Automated educational assessment is a growing area of interest with the advent of ChatGPT and other LLMs given the potential to improve the efficiency and accuracy of grading student work. The proposed system leveraging generative AI is a promising development in the field due to the ability to grade questions of varying formats. The ability to grade free-text questions is particularly noteworthy, as this has traditionally been time-consuming for instructors and teaching assistants. The system’s accuracy can be attributed to the advanced NLP capabilities of ChatGPT, which enables the assessment of student responses comparable to a human grader.

In addition to grading student work, the proposed system can generate relevant practice problems to improve content retention, provide instantaneous feedback to students, and identify weaknesses within a student cohort. This provides a substantial advantage over traditional assessment techniques, as students gain the ability to receive nearly instantaneous feedback on their performance. This helps them to identify areas where they need to focus their learning efforts. Immediate feedback boosts student motivation and confidence, supports an iterative approach to learning, and allows students to focus on targeted areas of weakness.

The system has several benefits that make it a substantial development for educational assessment. First, it reduces instructor workload through the automated grading of student work. Instructors must still approve the system’s decisions; however, the workload remains considerably reduced. By minimizing grading time, educators can focus their efforts on other aspects of teaching, such as lesson planning, course design, and student support. Second, the system provides instantaneous feedback to students, which will prove transformative to the improved acquisition of student learning outcomes. Students can use the feedback to identify areas of weakness and can receive immediate guidance on strategies for improvement. This proves particularly beneficial for students who require additional support.

In addition to reducing instructor workload and improving student understanding, the proposed system is able to generate relevant practice problems to improve content retention. The system also improves the efficacy of student study time by providing additional questions to practice and apply techniques to students. This is particularly important for students who struggle with certain concepts or require additional practice to master the material. At present, learning outcomes are standardized for an entire cohort. However, individualized learning outcomes allow for a tailored learning experience, where student background, interests, and overall requirements are considered. This is particularly relevant for courses taken by students from diverse disciplinary backgrounds.

Figure 7 provides several ChatGPT responses to a question containing potentially false premises. This demonstrates ChatGPT’s ability to provide factual information while discerning the verity of premises provided through a prompt.

Students’ perceptions of the use of ChatGPT by universities and colleges are critical to understanding its role in academic settings. A recent study, titled “Student Perceptions of ChatGPT Use in a College Essay Assignment: Implications for Learning, Grading, and Trust in Artificial Intelligence” (Tossell et al., 2024), found that students generally support the use of ChatGPT for grading, provided there is appropriate oversight. Specifically, students expressed a preference for instructors to use ChatGPT as a tool to assist in grading, but they did not trust the AI to perform the grading independently (Tossell et al., 2024). These findings align with our recommendation of adopting a human-in-the-loop (HITL) approach, where human instructors remain actively involved in supervising the grading process.

4.3. Risks of Leveraging LLMs

Although there are many benefits to leveraging LLMs as discussed herein, there are several key risks which should be considered and understood by educators and administrators prior to the adoption of any LLM technologies, including AI Assess. Two major considerations are the risks of biases and discrimination by the LLM, and the risk of hallucination. These two risks will be discussed in the following subsections.

4.3.1. Bias Risk

LLMs are trained on textual data produced by humans; as such, datasets are prone to the bias inherent in humans. There are several factors which can contribute to the bias of an LLM, including the data used to train the model, biases inherent in the selected algorithm, biases inherent in the labeling process of semi-supervised learning, and biases introduced through policy decisions (Ferrara, 2023). The ability to generalize data possessed by LLMs is critical for many of the tasks presented in this paper; however, concerns have been raised of the risk of overgeneralization. Further, since LLMs are trained on human data, they are susceptible to human biases. Since outputs are based on patterns in training data, LLMs may propagate biases and stereotypes, which may lead to the marginalization of certain groups.

There are also several types of biases which may arise, which are categorized as demographic, cultural, linguistic, temporal, confirmation, and ideological biases (Ferrara, 2023). Demographic biases arise when training data do not represent the breadth of human experiences, which may be biased towards certain groups. This is often not due to conscious bias, but systemic bias in the collection of data. Cultural biases arise when LLMs perpetuate cultural stereotypes, which may exacerbate cultural prejudices. As such, AI Assess and other LLM technologies may not be suitable for courses dealing with sensitive themes such as prejudice, discrimination, and equity. As LLMs such as ChatGPT are used at a large scale, developers are identifying and correcting demographic and cultural biases. In future iterations of ChatGPT, such biases may be reduced or eliminated, at which time it may be appropriate for AI Assess to be leveraged by such at-risk courses.

A prime example of bias, particularly for educators, is the tendency for ChatGPT to consider works by foreign students to be AI-generated text (Liang et al., 2023). This means that ChatGPT should not be used on its own as a plagiarism checker, and also highlights that ChatGPT may have other latent biases. The unknown nature of ChatGPT’s biases and the fact that OpenAI is constantly working on improving the model means that educators should be incredibly skeptical of the results from ChatGPT. They are helpful in giving a starting point, formatting work, and may even be helpful in a large capacity with grading when given context to work within. However, because ChatGPT output is designed to be convincing and can be correct, the results should be examined with extra scrutiny.

All forms of bias can be mitigated to some extent through human-in-the-loop (HITL) approaches.

4.3.2. Hallucination Risk

ChatGPT and other LLMs will occasionally hallucinate, producing inaccurate outputs after extended use of the context manager. As a conversation continues within an instance of an LLM, it will slowly lose the context within the conversation, and responses will become progressively less accurate. Unfortunately, hallucination is caused by technical limitations of the emerging technology. These hallucinations must be properly dealt with; otherwise, the hallucination will snowball (M. Zhang et al., 2023). Several models have been proposed to mitigate the risk of hallucination (Lee, 2023; T. Liu et al., 2022).

Instead of attempting to reduce the hallucination risk through further prompt engineering, in the case of educational assessment, it is advised that for each educational assessment task (e.g., for each student submission), a new instance of the LLM is created with a cleared context. This allows for the LLM to be focused on the specific student submission and its assigned tasks.

It should be noted that LLMs are designed to generate convincing text; this means that hallucinations can be very convincing and any responses from LLMs should be carefully examined.

4.3.3. Transparency and Audit System

To enhance the trustworthiness and accountability of the system, the authors propose implementing a comprehensive transparency system. This system would maintain detailed output logs recording AI-generated responses and prompt versions. This allows for instructor reviews while providing students with explanations for AI-generated assessments. A human oversight protocol is recommended, requiring instructor approval before finalizing AI-generated grades, with automated flagging of potentially problematic outputs. The system should consider regular analyses of grading patterns across student demographics. This will allow the system to address any systematic biases. Quality assurance measures must include periodic sampling for AI outputs for human verification. It is also critical to ensure that incidents of hallucination are tracked. Additionally, the prompt version should be tracked to document engineering improvements and modifications. This proposed audit system would ensure accountability while maintaining trust between users at all levels. This system helps to hedge against any potential biases and hallucinations which may arise, as discussed previously.

The proposed transparency system should incorporate a dedicated prompt repository that serves multiple essential functions in maintaining system effectiveness. Within this repository, instructors can document their modifications to default prompts and articulate the pedagogical reasoning behind these changes. Through this collaborative approach to prompt management, institutions can create and maintain a knowledge base of optimal prompt engineering practices while ensuring usability and a consistent output across varying contexts.

4.4. Edge Case Handling

While AI Assess as proposed may be sufficient for a traditional assessment, many assessments and student submissions vary significantly from the norm. Modern assessment techniques have increased flexibility in format and style. Thus, AI Assess should incorporate robust mechanisms for handling atypical assessments. This includes parsing malformed inputs, recognizing alternative terminologies, and accommodating diverse expression styles. The system could leverage few-shot prompting with examples of unconventional but valid answers to improve grading accuracy. For responses falling outside expected patterns, the system should automatically escalate to human review rather than risk misgrading. This approach is accommodating of students with different learning styles or those from diverse linguistic backgrounds.

AI Assess could also be extended beyond text-based submissions through specialized prompt engineering and multimodal LLM capabilities. For video submissions, prompts could analyze transcribed speech, presentation structure, and demonstrated skills using video-to-text processing. However, there are still limitations in the ability to assess delivery and presentation. Creative submissions like artwork, music, or design projects could leverage image recognition and domain-specific criteria encoding. Both extensions require careful validation to ensure fairness and accuracy across diverse creative expressions. The system would require enhanced few-shot prompting with domain expert examples and clear rubric translation for subjective assessment criteria. These future areas of research would maintain the core benefits of automated assessment while accommodating broader evaluation needs in creative and performance-based disciplines.

4.5. Limitations

The development of the proposed system also raises concerns about the potential for job displacement and the need for ongoing educator training (Anderson et al., 2018). Automating grading and feedback processes may lead to the displacement of some teaching jobs, particularly those with an administrative focus, such as grading. This highlights the need for ongoing training for educators and professional development to ensure that all staff are equipped with the skills required to effectively use generative AI technologies in the classroom. Additionally, there is a need to ensure that the system is designed in a way that complements and supports rather than replaces human teachers. It is critical that the human teacher remains integral to the process through an HITL approach, as while the proposed system has the potential to improve educational assessment and student learning outcomes, it is not a substitute for the valuable knowledge, expertise, and tailored education that human teachers provide.

LLMs struggle with numeric and arithmetic tasks (Frieder et al., 2023). For instance, if ChatGPT is asked to define an acronym with many repeated letters, it often miscounts and provides incorrect definitions. The components described herein will be more useful in text-based disciplines, and less effective in numeric disciplines such as mathematics and accounting.

5. Conclusions

AI Assess has the potential to revolutionize educational assessment. It has the ability to grade quiz-style questions, provide instantaneous feedback, generate relevant practice problems to improve content retention, and identify weaknesses in a student body. It has been shown to reduce instructor workload without compromising student learning outcomes.

The results displayed in Figure 6 show that providing human training in Experiment 2 has increased the exact match between human grades and automatic grades from 16% (4 out of 25 grades in Experiment 1) to 80% (20 out of 25 grades in Experiment 2). This further demonstrates the importance of the HITL approach. Figure 6 further shows that even the grades that do not exactly match are very close to each other (within a maximum difference of 5%, e.g., human grade 50%, automatic grade 55%).

The successful deployment of LLMs in educational assessment has the potential to significantly reduce instructors’ workload. However, this reduction is not entirely without cost, as educators must initially invest time in the prompt engineering process to optimize their assessment prompts. Since this process is performed once but can be reused across multiple semesters, the long-term time savings become evident. Over the course of their careers, this results in a substantial reduction in instructors’ workload.

As GPT technology continues to evolve, future iterations of AI Assess will conduct student assessment through conversation-style determination of knowledge, allowing for individualized learning outcomes. Despite the ethical considerations and potential biases, the education sector should embrace this new AI technology to effectively leverage its incredible potential to benefit students, educators, and educational institutions. Future development of AI Assess should prioritize creating user-friendly interfaces for prompt customization and management, enabling instructors to adapt the system to their specific pedagogical needs while maintaining its effectiveness as LLM technology evolves.

Future work of this research will focus on conducting student assessment through conversation-style determination of knowledge, allowing for individualized learning. This represents a significant development in the field of educational assessment, as it has the potential to provide a more personalized and engaging learning experience for students. By allowing for the conversational determination of knowledge, AI Assess engages students in a dialogue which permits and encourages the development of a deeper understanding. This approach can also help identify areas where students may struggle or require additional support, allowing for more targeted intervention than previously available.

However, the development of conversation-style determination of knowledge poses several challenges that must be addressed. One of the main challenges is ensuring that the system is accurate and reliable in assessing student knowledge. This requires the development of further NLP technologies and algorithms that can analyze and understand student responses in a more nuanced way. There is a need to ensure that AI Assess is accessible and usable for all students, including students with disabilities and students who struggle to gain access to technology (such as those who live in remote areas with poor internet connectivity) (Rennie et al., 2013). These limitations require further research. Multilingual extensions of AI Assess can be achieved through ChatGPT models and further prompt engineering (Fu et al., 2022).

Future areas of research in the field of educational assessment by generative AI should focus on improving accuracy and assurance of fairness. Additionally, ongoing research is needed to address concerns of potential bias and discrimination in the system (Ray, 2023). The system must be assuredly fair, unbiased, and highly accessible to all students. Other areas of further research include the integration of the AI Assess components with other educational technologies, such as learning management systems (LMSs). An LMS is a software application for the automation and administration of educational courses. Such an integration creates a more seamless and personalized teaching and learning experience for both educators and students. It would be valuable to validate the results determined herein using alternative text generation models such as Claude and Google Gemini (Rahaman et al., 2023). As discussed herein, further research into multimodality and creative assessment types is required, including video, artistic, and performance-based submissions. Overall, there is a need for ongoing research in the field to ensure continued growth, thereby realizing the potential of the technology to revolutionize the education sector (Ahmadi, 2018).

The potential for GPT-powered chatbots in educational assessments is extensive and must be embraced by the education sector. Further research and development can improve the system, allowing for more widespread implementation in the classrooms of the future.

Author Contributions

Conceptualization, A.D. (Ayman Diyab), B.D.F. and R.M.F.; methodology, A.D. (Ayman Diyab), B.D.F., R.M.F. and A.D. (Ahmad Diyab); software, B.D.F., R.M.F. and A.D. (Ayman Diyab); validation, A.D. (Ayman Diyab) and A.D. (Ahmad Diyab); formal analysis, B.D.F. and A.D. (Ayman Diyab); investigation, A.D. (Ayman Diyab); data curation, B.D.F.; writing—original draft preparation, B.D.F. and R.M.F.; writing—review and editing, B.D.F., R.M.F., A.D. (Ayman Diyab), and A.D. (Ahmad Diyab); visualization, A.D. (Ayman Diyab) and B.D.F.; supervision, A.D. (Ayman Diyab) and B.D.F.; project administration, B.D.F., R.M.F. and A.D. (Ayman Diyab); funding acquisition, B.D.F., R.M.F. and A.D. (Ayman Diyab). All authors have read and agreed to the published version of the manuscript.

Institutional Review Board Statement

Not applicable (the research is based on datasets available in the literature).

Informed Consent Statement

Not applicable.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Acknowledgments

The authors would like to acknowledge the support and assistance in early ideation from M. Makrehchi. They would also like to acknowledge the assistance provided through the early stages of the work by Alden O’Cain and Zackary Masri, whose expertise in software engineering and statistics, respectively, proved invaluable. We extend a sincere thank you to the authors who shared their exam datasets. We extend our sincere thanks to the anonymous reviewers whose careful reading, thoughtful feedback, and insightful recommendations greatly improved the quality and clarity of this manuscript. Their expertise and constructive criticism helped strengthen both the technical content and presentation of our work. We would finally like to express our gratitude for the assistance in editing and revision provided by Harrison Nelson.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	artificial intelligence
GPT	generative pre-trained transformer
HITL	human-in-the-loop
LLM	large language model
LMS	learning management system
NLP	natural language processing
STEM	science, technology, engineering, and mathematics

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Figures and Table

Figure 1. Brief descriptions of AI Assess tasks.

View Image - Figure 2. An activity diagram for the automated grading prompt use case. Red and green boxes correspond to the cognitive verification and reflection prompt patterns, respectively. Blue rounded boxes represent operations.

Figure 2. An activity diagram for the automated grading prompt use case. Red and green boxes correspond to the cognitive verification and reflection prompt patterns, respectively. Blue rounded boxes represent operations.

View Image - Figure 3. An activity diagram for the weakness identification prompt use case. Red boxes represent the cognitive verification prompt pattern. Green boxes represent the reflection prompt pattern. Yellow boxes represent the template prompt pattern. Blue rounded boxes represent operations.

Figure 3. An activity diagram for the weakness identification prompt use case. Red boxes represent the cognitive verification prompt pattern. Green boxes represent the reflection prompt pattern. Yellow boxes represent the template prompt pattern. Blue rounded boxes represent operations.

View Image - Figure 4. An activity diagram for the feedback provision prompt use case. Red and green boxes correspond to the cognitive verification and reflection prompt patterns, respectively. Yellow boxes represent templates. Blue rounded boxes represent operations.

Figure 4. An activity diagram for the feedback provision prompt use case. Red and green boxes correspond to the cognitive verification and reflection prompt patterns, respectively. Yellow boxes represent templates. Blue rounded boxes represent operations.

View Image - Figure 5. An activity diagram for the question generation prompt use case. Red boxes represent the cognitive verification prompt pattern. Green boxes represent the reflection prompt pattern. Yellow boxes represent the template prompt pattern. Blue rounded boxes represent operations.

Figure 5. An activity diagram for the question generation prompt use case. Red boxes represent the cognitive verification prompt pattern. Green boxes represent the reflection prompt pattern. Yellow boxes represent the template prompt pattern. Blue rounded boxes represent operations.

Figure 6. The scatter plots of grades for Experiment 1 (a) and Experiment 2 (b).

View Image - Figure 7. A survey of various ChatGPT-generated responses to the prompt “What is it called when a linked list accesses the previous node?”. The responses generated demonstrate ChatGPT’s ability to provide factual information while discerning misinformative premises.

Figure 7. A survey of various ChatGPT-generated responses to the prompt “What is it called when a linked list accesses the previous node?”. The responses generated demonstrate ChatGPT’s ability to provide factual information while discerning misinformative premises.

Table 1

A table of the prompt patterns leveraged (Wei et al., 2022; White et al., 2023). Bolded patterns are used in this paper.

Pattern Category	Prompt Pattern
Input Semantics	1. Meta Language Creation
Output Customization	2. Output Automater 3. Persona 4. Visualization Generator 5. Recipe 6. Template
Error Identification	7. Fact Check List 8. Chain-of-Thought Prompting
Prompt Improvement	9. Question Refinement 10. Alternative Approaches 11. Cognitive Verifier 12. Refusal Breaker
Interaction	13. Flipped Interaction 14. Game Play 15. Infinite Generation
Context Control	16. Context Manager

Appendix A. LMQL Installation and Setup

This appendix provides instructions for setting up LMQL to work with the functions described in this paper. LMQL requires Python 3.10 or newer and can be installed with pip.

Listing A1. Installing LMQL and required dependencies.

pip install lmql pip install openai # Required for ChatGPT integration

To use LMQL with ChatGPT, the OpenAI application programming interface (API) key will need to be configured.

Listing A2. LMQL configuration for ChatGPT.

import osos.environ["OPENAI_API_KEY"] = "your-api-key-here" # Test configurationimport lmqlmodel = lmql.model("chatgpt")

LMQL provides several decorators and syntax features that we used throughout our implementation.

@lmql.query—Marks a function as an LMQL query.
where clauses—Constrains variable outputs.
String interpolation with [VAR] syntax.
Type annotations for input validation.

For detailed documentation on LMQL’s features and syntax, refer to the official documentation at https://lmql.ai/docs/. All code examples in this paper assume that this basic setup is in place. The implementations provided can be run directly after completing this configuration.

Appendix B. Sample Questions

The five questions (Q4 to Q8) considered in Section 2.2 (Weakness Identification) are given below with their correct answers underlined.

Q4. “The system should be designed so it can be extended in the future to handle membership for children” is an aspect of a fitness club membership system that can be classified as:
- non-functional requirement
- functional requirement
- ethical requirement
- compatibility requirement
Q5. System requirements are high-level abstract requirements, whereas user requirements are detailed descriptions of what the software system should do.
- true
- false
Q6. In practice, requirements engineering is an iterative process in which the following generic activities are interleaved:
- Requirements analysis, requirements design, and requirements implementation
- Requirements planning, requirements analysis, requirements testing, and requirements deployment
- Requirements understanding, requirements analysis, and requirements documentation
- Requirements elicitation, requirements analysis, requirements validation, and requirements management
Q7. …………………are important constraints. However, they are usually not placed in the requirements document, but are found in the contract for the system or are left to A separate project plan document
- Delivery time and technology to be used
- Cost and delivery time
- Cost and reliability
- Technology to be used and cost
Q8. Software requirements can be divided into the following four major types.
- Entity, Boundary, function, and control
- Functional, nonfunctional, control, and boundary
- Functional, nonfunctional, quality, and platform
- Functional, quality, platform, and process

References

Abdullah, M.; Madain, A.; Jararweh, Y. ChatGPT: Fundamentals, applications and social impacts. 2022 Ninth International Conference on Social Networks Analysis, Management and Security (SNAMS); Milan, Italy, March 15; 2022; pp. 1-8. [DOI: https://dx.doi.org/10.1109/SNAMS58071.2022.10062688]

Afifi, M.; Hussain, K. The achievement of higher flexibility in multiple-choice-based tests using image classification techniques. International Journal on Document Analysis and Recognition (IJDAR); 2019; 22, 8 pp. 127-142. [DOI: https://dx.doi.org/10.1007/s10032-019-00322-3]

Ahmadi, M. R. The use of technology in english language learning: A literature review. International Journal of Research in English Education; 2018; 3, 2 pp. 115-125. [DOI: https://dx.doi.org/10.29252/ijree.3.2.115]

Alahmar, A.; AlMousa, M.; Benlamri, R. Automated clinical pathway standardization using SNOMED CT-based semantic relatedness. Digital Health; 2022; 8, 20552076221089796. [DOI: https://dx.doi.org/10.1177/20552076221089796] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/35392252]

Alahmar, A.; Benlamri, R. Optimizing hospital resources using big data analytics with standardized e-clinical pathways. IEEE International Conference on Dependable, Autonomic and Secure Computing, International Conference on Pervasive Intelligence and Computing, International Conference on Cloud and Big Data Computing, International Conference on Cyber Science and Technology Congress (DASC/PiCom/CBDCom/CyberSciTech); Calgary, AB, Canada, August 17–22; 2020; [DOI: https://dx.doi.org/10.1109/DASC-PICom-CBDCom-CyberSciTech49142.2020.00112]

Alahmar, A.; Crupi, M.; Benlamri, R. Ontological framework for standardizing and digitizing clinical pathways in healthcare information systems. Computer Methods and Programs in Biomedicine; 2020; 196, 105559. [DOI: https://dx.doi.org/10.1016/j.cmpb.2020.105559]

Alahmar, M. A. Rule based expert system for selecting software development methodology. Journal of Theoretical and Applied Information Technology; 2010; 19, 2 pp. 143-148.

Anderson, J.; Rainie, L.; Luchsinger, A. Artificial intelligence and the future of humans. Pew Research Center; 2018; 10, 12

Arbane, M.; Benlamri, R.; Brik, Y.; Alahmar, A. Social media-based COVID-19 sentiment classification model using Bi-LSTM. Expert Systems with Applications; 2023; 212, 118710. [DOI: https://dx.doi.org/10.1016/j.eswa.2022.118710]

Astin, A. W. Student involvement: A development theory for higher education. Journal of College Student Development; 1984; 40, pp. 518-529.

Baidoo-Anu, D.; Owusu Ansah, L. Education in the era of generative artificial intelligence (AI): Understanding the potential benefits of ChatGPT in promoting teaching and learning. Journal of AI; 2023; 7, 1 pp. 52-62. [DOI: https://dx.doi.org/10.61969/jai.1337500]

Beurer-Kellner, L.; Fischer, M.; Vechev, M. Prompting is programming: A query language for large language models. Proceedings of the ACM on Programming Languages; 2023; 7, 186. [DOI: https://dx.doi.org/10.1145/3591300]

Brown, T. B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; Agarwal, S.; Herbert-Voss, A.; Krueger, G.; Henighan, T.; Child, R.; Ramesh, A.; Ziegler, D. M.; Wu, J.; Winter, C.; Amodei, D. GPT-3: Language Models are Few-Shot Learners. arXiv; 2020; [DOI: https://dx.doi.org/10.48550/arXiv.2005.14165]

Busatta Galhardi, L.; Duilio Brancher, J. Machine learning approach for automatic short answer grading: A systematic review. 16th Ibero-American Conference on AI; Trujillo, Peru, November 13–16; 2018.

Chakraborty, U.; Konar, D.; Roy, S. Automatic short answer grading using rough concept clusters. International Journal of Advanced Intelligence Paradigms; 2018; 10, 1 pp. 260-280. [DOI: https://dx.doi.org/10.1504/IJAIP.2019.103413]

Ferrara, E. Should ChatGPT be biased? Challenges and risks of bias in large language models. arXiv; 2023; [DOI: https://dx.doi.org/10.48550/arXiv.2304.03738]

Frieder, S.; Pinchetti, L.; Griffiths, R.-R.; Salvatori, T.; Lukasiewicz, T.; Petersen, P. C.; Chevalier, A.; Berner, J. Mathematical capabilities of ChatGPT. arXiv; 2023; [DOI: https://dx.doi.org/10.48550/arXiv.2301.13867]

Fu, J.; Ng, S.-K.; Liu, P. Polyglot prompt: Multilingual multitask prompt training. arXiv; 2022; [DOI: https://dx.doi.org/10.48550/arXiv.2204.14264]

Gomaa, W.; Fahmy, A. Short answer grading using string similarity and corpus-based similarity. International Journal of Advanced Computer Science and Applications; 2012; 3, 11 pp. 115-121.

Kumar, A. Computer science and engineering, Jain University, faculty research profile, dr. A. Kumar.; 2023; Available online: https://set.jainuniversity.ac.in/Faculty-kumar/ (accessed on 7 September 2023).

Lee, M. A mathematical investigation of hallucination and creativity in GPT models. Mathematics; 2023; 11, 10 2320. [DOI: https://dx.doi.org/10.3390/math11102320]

Liang, W.; Yuksekgonul, M.; Mao, Y.; Wu, E.; Zou, J. GPT detectors are biased against non-native English writers. Patterns; 2023; 4, 7 100779. [DOI: https://dx.doi.org/10.1016/j.patter.2023.100779]

Liu, P.; Yuan, W.; Fu, J.; Jiang, Z.; Hayashi, H.; Neubig, G. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. arXiv; 2021; [DOI: https://dx.doi.org/10.1145/3560815]

Liu, T.; Zhang, Y.; Brockett, C.; Mao, Y.; Sui, Z.; Chen, W.; Dolan, B. A token-level reference-free hallucination detection benchmark for free-form text generation. arXiv; 2022; [DOI: https://dx.doi.org/10.48550/arXiv.2104.08704]

Liu, X.; Wang, J.; Sun, J.; Yuan, X.; Dong, G.; Di, P.; Wang, W.; Wan, D. Prompting frameworks for large language models: A survey. arXiv; 2023; [DOI: https://dx.doi.org/10.48550/arXiv.2311.12785]

Lund, B. D.; Wang, T. Chatting about ChatGPT: How may AI and GPT impact academia and libraries. Library Hi Tech News; 2023; 40, 3 pp. 26-29. [DOI: https://dx.doi.org/10.1108/LHTN-01-2023-0009]

Odeh Al-Awaida, S. Automated Arabic essay grading system based on support vector machine and text similarity algorithm. Ph.D. thesis; Middle East University: 2019.

OpenAI,. Chat with GPT-3 and win; 2022; Available online: https://openai.com/blog/chat-gpt-3-launch/ (accessed on 1 February 2023).

Oppenlaender, J.; Linder, R.; Silvennoinen, J. Prompting AI art: An investigation into the creative skill of prompt engineering. arXiv; 2023; [DOI: https://dx.doi.org/10.1080/10447318.2024.2431761]

Ozdemir, S. Quick start guide to large language models: Strategies and best practices for using chatgpt and other llms; Addison-Wesley Professional: 2023.

Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving language understanding by generative pre-training. OpenAI Blog; 2018; 1, 8 9.Available online: https://openai.com/blog/language-unsupervised/ (accessed on 15 February 2023).

Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog; 2019; 1, 9 pp. 1-18. Available online: https://openai.com/blog/better-language-models/ (accessed on 15 February 2023).

Rahaman, M. S.; Ahsan, M. M. T.; Anjum, N.; Mizanur, R. M.; Rahman, M. N. The AI Race is on! Google’s Bard and Openai’s Chatgpt head to head: An opinion article. SSRN; 2023; [DOI: https://dx.doi.org/10.2139/ssrn.4351785]

Ray, P. P. ChatGPT: A comprehensive review on background, applications, key challenges, bias, ethics, limitations and future scope. Internet of Things and Cyber-Physical Systems; 2023; 3, pp. 121-154. [DOI: https://dx.doi.org/10.1016/j.iotcps.2023.04.003]

Rennie, E.; Crouch, A.; Wright, A.; Thomas, J. At home on the outstation: Barriers to home Internet in remote Indigenous communities. Telecommunications Policy; 2013; 37, 6 pp. 583-593. [DOI: https://dx.doi.org/10.1016/j.telpol.2012.07.007]

Rudolph, J.; Tan, S.; Tan, S. ChatGPT: Bullshit spewer or the end of traditional assessment in higher education. Journal of Applied Learning and Teaching; 2023; 6, 1 pp. 342-363.

Salam, M.; El-Fatah, M.; Hassan, N. Automated grading for Arabic short answer questions using optimized deep learning model. PLoS ONE; 2022; 17, 8 e0272269. [DOI: https://dx.doi.org/10.1371/journal.pone.0272269]

Schinske, J.; Tanner, K. Teaching more by grading less (or differently). CBE Life Sciences Education; 2014; 13, 2 pp. 159-166. [DOI: https://dx.doi.org/10.1187/cbe.cbe-14-03-0054] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/26086649]

Schneider, J.; Richner, R.; Riser, M. Towards trustworthy autograding of short, multi-lingual, multi-type answers. International Journal of Artificial Intelligence in Education; 2023; 33, pp. 88-118. [DOI: https://dx.doi.org/10.1007/s40593-022-00289-z]

Schulhoff, S. Shot-based prompting; 2024; Available online: https://learnprompting.org/docs/basics/few_shot (accessed on 15 June 2024).

Seradi, A. Exam marking lab; 2023; Available online: https://sites.google.com/view/exam-marking-lab/home/ (accessed on 7 September 2023).

Tlili, A.; Shehata, B.; Adarkwah, M. A.; Bozkurt, A.; Hickey, D. T.; Huang, R.; Agyemang, B. What if the devil is my guardian angel: ChatGPT as a case study of using chatbots in education. Smart Learning Environments; 2023; 10, 15 15. [DOI: https://dx.doi.org/10.1186/s40561-023-00237-x]

Tossell, C. C.; Tenhundfeld, N. L.; Momen, A.; Cooley, K.; de Visser, E. J. Student perceptions of ChatGPT use in a college essay assignment: Implications for learning, grading, and trust in artificial intelligence. IEEE Transactions on Learning Technologies; 2024; 17, pp. 1069-1081. [DOI: https://dx.doi.org/10.1109/TLT.2024.3355015]

Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.; Le, Q.; Zhou, D. Chain-of-thought prompting elicits reasoning in large language models. arXiv; 2022; [DOI: https://dx.doi.org/10.48550/arXiv.2201.11903]

White, J.; Fu, Q.; Hays, S.; Sandborn, M.; Olea, C.; Gilbert, H.; Elnashar, A.; Spencer-Smith, J.; Schmidt, D. C. A prompt pattern catalog to enhance prompt engineering with ChatGPT. arXiv; 2023; [DOI: https://dx.doi.org/10.48550/arXiv.2302.11382]

Wooldridge, M. A brief history of artificial intelligence—What it is, where we are, and where we are going; Flatiron Books: 2021.

Zamfirescu-Pereira, J. D.; Wong, R. Y.; Hartmann, B.; Yang, Q. Why Johnny can’t prompt: How non-AI experts try (and fail) to design LLM prompts. CHI ’23: CHI Conference on Human Factors in Computing Systems; Hamburg, Germany, April 23–28; 2023; [DOI: https://dx.doi.org/10.1145/3544548.3581388]

Zhai, X. ChatGPT User Experience: Implications for Education. SSRN; 2023; [DOI: https://dx.doi.org/10.2139/ssrn.4312418]

Zhang, L.; Huang, Y.; Yang, X.; Yu, S.; Zhuang, F. An automatic short-answer grading model for semi-open-ended questions. Interactive Learning Environments; 2019; 30, 1 pp. 177-190. [DOI: https://dx.doi.org/10.1080/10494820.2019.1648300]

Zhang, M.; Press, O.; Merrill, W.; Liu, A.; Smith, N. A. How language models hallucinations can snowball. arXiv; 2023; [DOI: https://dx.doi.org/10.48550/arXiv.2305.13534]

Word count: 11884

Show less

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

AI Assess, a ChatGPT-based assessment system utilizing the ChatGPT platform by OpenAI, composed of four components, is proposed herein. The components are tested on the GPT model to determine to what extent they can grade various exam questions based on learning outcomes, generate relevant practice problems to improve content retention, identify student knowledge gaps, and provide instantaneous feedback to students. The assessment system has been explored using software engineering and computer science courses and is successful through testing and evaluation. AI Assess has demonstrated the ability to generate practice problems based on syllabus information and learning outcomes. The components have been shown to identify weak areas for students. Finally, it has been shown to provide different levels of feedback. The combined set of components, if incorporated into a complete software system and implemented in classrooms with proposed transparency mechanisms, has vast potential to reduce instructor workload, improve student understanding, and enhance the learning experience. The potential for GPT-powered chatbots in educational assessments is vast and must be embraced by the education sector.

Details

Title

Engineered Prompts in ChatGPT for Educational Assessment in Software Engineering and Computer Science

Author

Diyab, Ayman¹

; Russell Morris Frost¹

; Fedoruk, Benjamin David²

; Diyab, Ahmad³

¹ Faculty of Engineering, Lakehead University, Thunder Bay, ON P7B 5E1, Canada; [email protected]
² Mitch and Leslie Frazer Faculty of Education, Ontario Tech University, Oshawa, ON L1G 0C5, Canada
³ Shad Alumni, Western University, London, ON N6A 3K7, Canada; [email protected]

First page

156

Publication year

2025

Publication date

2025

Publisher

MDPI AG

e-ISSN

22277102

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.3390/educsci15020156

ProQuest document ID

3170873298

Engineered Prompts in ChatGPT for Educational Assessment in Software Engineering and Computer Science

Jump to:

Full Text

2. Materials and Methods

2.1. Automated Grading

2.2. Weakness Identification

2.3. Feedback Provision

2.4. Question Generation

3.1. Automated Grading

3.2. Weakness Identification

3.3. Feedback Provision

3.4. Question Generation

4.1. Engineering Effective Prompts

4.2. Benefits over Traditional Assessment Techniques

4.3. Risks of Leveraging LLMs

4.3.1. Bias Risk

4.3.2. Hallucination Risk

4.3.3. Transparency and Audit System

4.4. Edge Case Handling

4.5. Limitations

Abstract

Details

Suggested sources