Content area
The exceptional progress in artificial intelligence is transforming the landscape of technical jobs and the educational requirements needed for these. This study’s purpose is to present and evaluate an intuitive open-source framework that transforms existing courses into interactive, AI-enhanced learning environments. Our team performed a study on the proposed method’s advantages in a pilot population of teachers and students which assessed it as “involving, trustworthy and easy to use”. Furthermore, we evaluated the AI components on standard large language model (LLM) benchmarks. This free, open-source, AI-enhanced educational platform can be used to improve the learning experience in all existing secondary and higher education institutions, with the potential of reaching the majority of the world’s students.
Full text
1. Introduction
Artificial intelligence has developed in recent years at a remarkable pace, a growth that has influenced and continues to influence many fields, education being no exception. In this article, we highlight the practical potential of AI to revolutionize secondary and higher education, modifying and improving existing pedagogical approaches by creating motivating, effective, and engaging learning environments.
In Section 2, we study the current advances in teaching and evaluation methods, assessing the benefits of these enhanced methods and their reach in the real world as a proportion of the total student population. We found that less than 10% of the global education system is benefiting from these advanced techniques (see Section 2.3).
We emphasize a major issue in education, the fact that most education institutions are lagging in adopting new technologies in their learning methods. While top universities are able to pioneer the newest education tools in their methods (for example, most Ivy League universities take advantage of online courses, laboratories, and AI-assisted student assistants), this is not the case for the majority of 1.5 million secondary-level [1] and 25000 higher education institutions in the world. In this bulk percentage of educational institutions, which serve 90% of the world’s students, the teaching and evaluation methods are still lagging.
At the same time, the majority of students nowadays prefer electronic support as course distribution support; in plain words, students would prefer to learn on a phone, tablet or computer (See Section 2) [2,3,4]. But, currently, most course materials—main courses, laboratory courses, seminars and exam questions—are mostly available as PDF files, Word documents, or printed books, formats that are not well-suited for direct use in computer- or AI-assisted learning methods.
We assess the main causes/barriers to the introduction of these advanced methods, including low funding, lack of technical experience of the existing teaching corps and low student access to computing resources.
To address these major issues, we introduce in Section 3 an open-source, one-click platform that enables teachers across all these universities to transform their traditional courses into an AI-enhanced learning experience, regardless of their technical expertise. Additionally, this platform is accessible to students from any browser on a computer, tablet or smartphone.
In Section 4, we evaluate the proposed platform through two experiments: (a) a pilot study on a population of teachers and students to assess application usability and correctness; (b) human and automated tests on standard LLM benchmarks to assess the AI components’ performances.
In Section 5, we discuss the solution’s benefits and the issues it solves, before presenting the closing thoughts in Section 6.
2. Review of New Methods Used in Education
2.1. Computer-Aided Methods in Education
Computer-Assisted Methods in Education (CAME) refer to instructional technologies and strategies that leverage computers to enhance teaching and learning. Emerging in the 2000s with the advent of the internet, these methods have since been extensively developed and widely adopted. They offer several benefits, including individualized learning, increased accessibility for students with disabilities, enhanced engagement, the opportunity for self-paced study, and the provision of immediate feedback [5,6].
CAME uses different teaching techniques and technological tools that seek to improve the instructional educational process, of which we discuss the most important: gamification, microlearning, Virtual Reality and, more recently, artificial intelligence (AI).
Virtual Reality (VR) is combined with Augmented Reality (AR) in teaching to completely transform educational experiences by providing dynamic and attractive contexts. AR enhances real-world experiences by superimposing digital data over the surrounding world, while VR lets students explore 3D worlds.
Another computer-based learning method is gamification, integrating specific game elements into the educational context. This approach has the effect of making learning activities more participatory and spectacular. The game-specific competition that gives students a sense of satisfaction is composed of simple elements such as leaderboards, points earned and badges [7]. The market for gamification offerings in education has grown significantly, from USD 93 million in 2015 to almost USD 11.5 billion by 2020, demonstrating its growing acceptance in the industry [6]. This upward trajectory has continued, with the market reaching approximately USD 13.50 billion by 2024 and is projected to grow to USD 42.39 billion by 2034 [8].
Microlearning is about to become a major trend in the educational landscape, standing out by providing solutions to problems such as cognitive overload and teacher fatigue. This new method provides students with short and specific content when it suits them. According to some studies [6,9], microlearning segments of between two and five minutes are found to be very effective in terms of engagement and retention of new information. Microlearning is also suitable for mobile, as it allows learners to interact “on the go” and access knowledge at the right moments [6].
In recent years, artificial intelligence has added to these modern methods of education, especially in adaptive learning systems that provide content specific to the expectations of each student, thus improving the overall educational process [9,10]. This personalization improves the overall learning process and increases its effectiveness.
AI is already seen to play a significant role in education and is continually developing, resulting in intelligent learning guidance programs that adapt to the needs of each student and complex adaptive learning systems [9,10].
2.2. AI in Education
AI has significant potential to transform both teaching and learning in education. AI solutions can automate administrative duties, tailor customized learning pathways, provide adaptive feedback, and overall create more engaging and efficient educational experiences.
In general education, AI has been used to develop personalized learning platforms, intelligent tutoring systems, and even automated essay scoring, although in incipient stages, such as Pearson’s AI study tool [11].
From kindergarten to university level, institutions are faced with the challenge of adapting to diverse student populations with varying learning styles and paces. Here, AI can provide an answer by providing the following: Personalized Learning/Adaptive Feedback: AI algorithms can personalize learning by assessing student performance, providing instant feedback, and adjusting the pace, content, and evaluation to cater to each learner’s needs [4,5]. 100% availability for answers from courses: Similar to a teacher providing answers to students during the main class, an AI solution can provide answers from the course materials to an unlimited number of students 24/7. Automated Grading: AI can automate tasks such as assignment grading and providing feedback to student inquiries, thus liberating faculty time for more in-depth student interactions. Enhanced Research: In higher education, AI tools assist researchers [6] in analyzing large datasets and phenomena, being an effective companion in extracting patterns and generating new ideas [7,9]. Improved Accessibility: AI-powered solutions can be enhanced with text-to-speech and speech-to-text features, providing support to students with disabilities.
There are five main directions in the domain of artificial intelligence (AI) application in education: assessment/evaluation, prediction, AI assistants, intelligent tutoring systems (ITSs), and the management of student learning [12]; each demonstrates potential for innovation in the education sector. We discuss below the ITSs, AI assistants and AI evaluators.
An ITS is an AI system which provides personalized adaptive instruction, real-time feedback, and tailored learning experiences to support student progress [13,14]. An AI assistant is generally a simple chatbot which can answer student questions about the class similarly to how a teacher assistant would. The assessment/evaluation AI is basically able to grade students’ test answers.
We can find in the existing literature different studies that present the benefits and use-cases for AI across various educational domains such as the social sciences [15], engineering [16], science [17], medicine [18], the life sciences [19], and language acquisition [20], among others [21,22].
In recent years, the applications of artificial intelligence (AI) in the educational sector have been extensively explored, focusing on chatbots [23], programming assistance [24,25], language models [26,27], and natural language processing (NLP) tools [28]. More recently, the introduction of OpenAI’s Generative artificial intelligence (AI) chatbot, ChatGPT 4.0 [28] and competitor models such as Gemini [29], Claude [30] or LLAMA [31] has attracted considerable interest [32,33]. These chatbots are based on large language models (LLMs), trained on datasets of 100 billion words [25], and have the capability to process, generate and reason in natural languages [26,27,28] at human-expert-equivalent level. As a result of their high performance, AI chatbots have gained widespread popularity [34,35], and they are now starting to benefit diverse fields, including research [36,37,38] and education [39].
In conclusion, both traditional algorithmic tools and AI technologies are used to improve the quality of the education process, including learning and evaluation.
2.3. Estimation of the Usage of New Methods in Education
Research on students’ attitudes and behavior about paper vs. digital learning has become a fascinating area of study. In this section, we present an overview of the current split of the supports and tools used in education and its evolution over time.
Paper-Based Materials (Textbooks and Print Handouts): Traditionally, nearly all instructional supports were paper-based. Meta-analyses in education [40] indicate that, in earlier decades, printed textbooks and handouts could represent 70–80% of all learning materials. Over the past two decades, however, this share has steadily decreased (now roughly 40–50%) as digital tools have been integrated into teaching.
Computer-Based Supports (Websites, PDFs, and Learning Management Systems): Research [41,42] in the COVID-19 pandemic period demonstrates that Learning Management Systems (LMSs) and other computer-based resources (including websites and PDFs) have increased from practically 10% to about 40–50% of the educational supports in some settings. This evolution reflects both improved digital infrastructure and shifts in teaching practices.
Smartphones and Mobile Apps: Studies [43] in the early 2010s reported very limited in-class smartphone use. Over time, however, as smartphones became ubiquitous, more recent research [44] shows that these devices have now grown to roughly 20–30% of learning interactions. This growth reflects both increased mobile connectivity and the rising popularity of educational apps.
Interactive Digital Platforms (Websites, Multimedia, and Collaborative Tools): Parallel to the growth in LMS and mobile use, digital platforms that incorporate interactive multimedia and collaborative features have also expanded. Meta-analyses [45] indicate that while early-2000s classrooms saw digital tool usage in the order of 10–20%, today, these platforms now comprise roughly 30–40% of the overall learning support environment. This trend underscores the increasing importance of online content and real-time collaboration in education.
These studies show an evolution from a paper-dominant model toward a blended environment where computer-based resources and mobile devices have grown significantly over the past two decades. Still, each mode of support plays a complementary role in modern education, and many studies also show that paper is still a preferred medium, especially from the point of view of reading experience [46].
For example, large-scale international surveys (10,293 respondents from 21 countries [47] and 21,266 participants from 33 countries [48]) have consistently indicated that most college students prefer to read academic publications in print. These same studies found a correlation between students’ age and their preferred reading modes, with younger students favoring printed materials. A qualitative analysis of student remarks reported in [49] indicates that students’ behavior is flexible. Students usually learn better when using printed materials, albeit this relies on several criteria, including length, convenience, and the importance of the assignments [50].
2.4. The Shift Towards Using AI Tools
While the above papers showed a good percentage of students still prefer paper support, we have seen in the last 1–2 years a huge shift towards AI-based tools, especially for school assignments and countries with greater access to tech.
Based on several recent studies and surveys, we can estimate that up to 40% of US students—across both secondary and higher education—use AI-based educational tools. To illustrate this shift more concretely, several key surveys can be highlighted below: (a) Global Trends: A global survey conducted by the Digital Education Council (reported by Campus Technology, 2024) [2] found that 86% of students use AI for their studies. This study spanning multiple countries including the US, Europe, and parts of Asia highlights widespread global adoption. (b) United States Surveys: In the United States, an ACT survey [3] of high school students (grades 10–12) reported that 46% have used AI tools (e.g., ChatGPT) for school assignments. A survey by Quizlet (USA, 2024) [51] indicates that adoption is even higher in higher education, with about 80-82% of college students reporting they use AI tools to support their learning. (c) Additional Studies: Additionally, a quantitative study [4] involving global higher education institutions found that nearly two-thirds (approximately 66%) of students use AI-based tools for tasks such as research, summarization, and brainstorming.
Together, these findings suggest that while usage rates vary by education level and region (with higher rates in the US), there is a continuing global trend towards integrating AI-based educational tools in schools and universities.
2.5. Chatbots in Education
2.5.1. Chatbots: Definition and Classification
A chatbot application is, in simple terms, any application, usually web-based, which can chat with a person in a similar way a human does, being able to answer user questions and follow the history of the conversation.
Chatbot applications can be classified [52] based on distinct attributes such as the domain of knowledge, services provided, objectives (goals) and response generation and responses generated [53], as summarized in Table 1. Each classification method highlights specific characteristics that determine how a chatbot operates and interacts with users, as detailed in Table 1.
Based on the knowledge domain the apps can access, we classify the chatbots as (a) open knowledge domain (general-purpose) and (b) closed knowledge domain (domain-specific) bots. Open chatbot applications address general topics and answer general questions (like Siri or Alexa). On the other hand, closed chatbots have specific knowledge domains to provide answers to questions from different domains [53,54].
Based on the service provided, we can classify the bots as (a) informational when they merely provide known information, (b) transactional when they can handle actions like bookings, or (c) assistants when their role is to provide assistance in a similar way as a support person [53].
Based on their fulfilled objective/goal, bots can be designed to either complete specific user tasks (task-oriented), engage in a human-like dialogue (conversational), or be specialized in learning/training.
Finally, based on how the chatbot generates the responses from the input [55], the chatbots can be rule-based, can leverage a full AI architecture, or use a combination of the two (hybrid) [53].
2.5.2. Chatbots: Structure and Role in Education
A chatbot is an application, implemented programmatically or leveraging Generative AI [55,56], that understands and answers questions from human users (or chatbots) in natural language [57] on a particular topic or general subject, by text or voice [58,59].
Figure 1 illustrates the general workflow for interacting with a chatbot that integrates natural language processing (NLP) with a knowledge base retrieval system. The process begins when the chatbot receives input from the user, whether as text, voice, or both. This input is then converted into text and forwarded to the NLP component, which processes and comprehends the query. The response area uses different algorithms to process the existing knowledge base and then provides a variety of responses to the response selector. In this data processing step, the answer selector uses machine learning and artificial intelligence algorithms to be able to choose the most appropriate answer for the input [60]. The current trend is to move towards a more streamlined system that consolidates the process into fewer steps (Question–AI Model–Response), albeit at a higher cost per question, as shown in Figure 1b.
2.5.3. Educational Chatbots Survey
In the subclass of specialized service-oriented applications [57], we can include educational chatbots. We present a short review of the most common educational chatbots in Table 2.
As seen above, the reach of AI in education is high and increasing. Still, we face major barriers, such as the following: Insufficient training and digital literacy among educators. For instance, ref. [4] found that many higher education teachers across several countries felt unprepared to fully leverage AI due to inadequate institutional training and support. High implementation costs, especially in institutions with lower resources [68]. Lack of clearly defined and easily applicable policies for the ethical adoption of AI components as well as concerns about data privacy and fears of algorithmic bias.
These obstacles foster skepticism, and they need to be addressed for the effective introduction of AI tools in education.
We reviewed the split of different supports in education and presented the benefits and challenges of computer-aided AI solutions for education. These limitations can be overcome by the AI-enabled open-source framework we propose in Section 3.
3. Materials and Methods
3.1. Proposed Solution Description
Traditional educational approaches often struggle to address the diverse needs of individual learners, particularly in the science, technology, engineering, and mathematics (STEM) domain. Limited instructor availability, inconsistent feedback, and a lack of personalized learning experiences can hinder student progress and engagement [69]. To address these challenges, we developed an AI-powered teaching assistant designed to improve the learning process in university-level programming courses. This framework is open-source and can be used by any instructor and student without any technical skills required.
3.2. High-Level Description
Our application consists of four student modules and two supervisor/teacher modules.
Student-facing modules
Module 1—AI Teaching Assistant. In this module, the student has access to the course material and to an AI assistant (chatbot) who can answer questions related to the contents of the course material.
Module 2—Practice for Evaluation. In this module, students can prepare for the final assessment using “mock assessments”, where the questions generated by the AI are similar to but different from those of the final assessment. The student is tested on multiple-choice questions, open-text questions, and specific tasks (for example, computer programming). The AI evaluator provides immediate feedback on the student’s answers, suggesting improvements and providing explanations for correct answers. It is worth mentioning that the AI evaluator can also redirect students to Module 1 to review the relevant course information and seek further clarification from the AI assistant.
Module 3—Final Evaluation. After students complete several practice evaluations, they have the option to take the final evaluation. The module is similar to the practice for evaluation module, but it represents the “real exam”; questions are created by the teacher (or generated by AI and validated by the teacher) and graded by the AI evaluator, and the final grades are transcribed in the official student file. The feedback and grades can be provided instantly or be delayed until the teacher double-checks the evaluation results.
Module 4—Feedback. This module allows students to provide optional feedback related to their experience using this AI application. This feedback is essential for further development and improvement of this framework.
Supervisor modules
Module 5—Setup. Teachers are provided with a simple interface allowing them to carry out the following: Add a new course; Drag-and-drop the course chapters documents (in PDF, DOC or OPT format); Drag-and-drop a document containing exam questions or opt for automatic exam question generation; Optionally upload excel files with the previous year’s student results for statistical analysis.
Module 6—Statistics. This module extracts statistics on the year-on-year variation in students’ results. We use it to evaluate the efficiency of this teaching method vs. the previous year’s results in the classical teaching system (See Section 4.1.3).
3.3. Description of the User Interface
The user interface of our application is presented below in Figure 2:
The UI is divided in three frames as shown in Figure 2. The left frame is for navigation, the center frame for displaying the course material (PDFs) or exam questions, and the right frame for accessing the AI assistant or AI evaluator.
Depending on the module he chooses, the UI provides the following functionalities: -. Module 1—AI Assistant: In this module, the student selects in the left frame one of the course chapters which is displayed in the center frame. Then, the student can ask the AI assistant questions about the selected chapter in the right frame. -. Modules 2 and 3—Preparation for Evaluation/Final Evaluation: These modules share a similar UI and present students with exercises, questions, and programming tasks in the middle frame. In the right frame, the chatbot provides feedback on student responses or links to Module 1 so the student can review the course documentation. -. Module 4—Feedback Survey: The center frame displays a form for evaluating the application.
3.4. Architecture Details of the Platform
In this section, we provide a detailed description of the architecture of our solution (See Section 3.4.1) and details of the AI components’ architecture (See Section 3.4.2).
3.4.1. General Software Application Architecture and Flow
The application leverages a multi-layered architecture comprising a Frontend, a Backend, and external services, orchestrated within a containerized environment.
The architecture is highly modular, with clear separation of roles between Frontend, Backend, and LLM components. This allows for easy extension and modification (e.g., changing the LLM).
The solution can be deployed via Docker on any server, in our case being deployed on a Google Cloud CloudRun [70] instance. Docker is a packaging concept which simplifies deployment and ensures consistent behavior across different environments.
To understand the architecture of the application, we illustrate in Figure 3 the main components of the application and the usual dataflow in the case of the Module 1. Briefly, the students ask a question to the Frontend (UI) component of the application which runs in a Docker Container in Google Cloud. The question is processed by Backend, augmented with the course-relevant context, and then processed by an external LLM, and the answer is sent back to the student.
3.4.2. AI Components/Modules’ Architecture
The AI components are implemented in Modules 1, 2, and 3 of the Backend components.
Module 1—AI Assistant
To maintain its independence from any LLM provider, the Backend architecture for Module 1 is centered around three base concepts: LLM Protocol: This is an interface which describes the minimal conditions that an LLM needs to implement to be usable; in this case, it should be able to answer a question. RAG Protocol: This is an interface which describes what the Retrieval Augmented Generation (RAG) pattern should implement. The main idea [71] is to use a vector database to select possible candidates from the documentation (PDF course support), which are provided as context to an LLM so it can answer the student question. The objects implementing the RAG Protocol provides functions for the following: Retrieves context—When given a question, it is able to retrieve the possible context from the vector DB. Embedding Generation—This is a helper function to convert (embed) text into vector representation. Similarity Search—This performs similarity searches within the vector store database to find the most relevant chunks. LLM RAG (Figure 4) is a class which contains the following: An object ‘rag’ which implements the RAG Protocol (for example, RAGChroma) to store and retrieve relevant document chunks (content) for the question. A function to augment the question with the context recovered from the ‘rag’ object. An object ‘llm’ which implements the LLM Protocol (for example, LLMGemini) to answer the augmented question.
The flow of a question in LLM-RAG is shown in Figure 5. The question from a student is sent to LLM_RAG. In turn, the LLM RAG calls the component RAG Chroma for context. The LLM_RAG then creates an augmented question in a format similar to “Please answer {question} in the context {context extracted by rag}”. This question is sent to the LLMGemini component, and the answer is sent back to the student, eventually enhanced with snippets from the course material.
Module 2—AI Practice for Evaluation and Module 4—AI Evaluation
Both Modules 2 and 3 are implemented around the evaluator protocol concept: an interface which describes how an evaluator (judge) should grade one question and the feedback it should provide to the student, similar to the concept of LLM as a judge presented in HELM, Ragas [72,73].
The evaluator protocol implementations are providing functions for evaluating user responses including the following: Evaluate Answer: Compares user’s free text or single-choice answers with correct answers. Evaluate Code Answer: Executes user-provided code snippets and compares them against correct code. Calculate-Score: Calculates the overall score based on a list of responses.
3.5. Technical Implementation of the Platform
We present the details of the technical implementation of our platform. As outlined in the general architecture Section 3.4, the application is divided into Frontend and Backend which are packed in a Docker image deployed over Google CloudRun. We utilize an instance with 8 GB of memory for tests, but 2 GB should be enough for normal usage.
All components are a combination of standard Python code and LLMs prompted specially to be used in question/answer, evaluator, or assistant mode.
3.5.1. Frontend Implementation
The Frontend is based on Streamlit, an open-source Python 3.11 framework used for building and deploying data-driven web applications.
The Frontend component implements the following: Session Management: Manages user sessions and state. This includes handling Google OAuth 2.0 authentication. User Interface: Provides the user interface for interaction, including chapter selection, course navigation, dialog with the AI, evaluation, and feedback surveys: ○. Navigation: Uses a sidebar for primary navigation, allowing users to select chapters/units and specific course materials. ○. Dialog Interaction: Renders a dialog zone where users interact with the AI assistant. This includes input fields and display of AI responses. ○. Evaluation Display: Presents evaluation results to the user. ○. Styling: Streamlit themes and custom CSS.
3.5.2. Backend Implementation
The Backend is modular and implements all components described in 3.1, with differences between Modules 1-3 (AI) and 4-5-6 (Feedback, Setup, and Statistics).
Modules 1-3 implement the AI components. To implement the LLM Protocol, we used as backbone mainly Gemini versions 1.0, 1.5 Flash, 1.5 Pro, 2.0 Flash and 2.0 Pro [74,75,76] from Google Vertex AI [77], but we also tested ChatGPT and Claude Sonnet [78,79].
For the LLMRAG, we used a Chroma DB implementation, but Microsoft Azure AI Search [79] or Vertex AI Search [80] could be substituted based on preference.
The implementation of the evaluator protocol was also a custom-made class, similar to LLM-as-a-judge [81], using Gemini 2.0 as a backbone.
Modules 4-5-6 (Feedback, Setup, and Statistics) were implemented in Python, using the SQLite database for storing the course content, exam questions, student list, grades, and feedback. Statistics graphs were generated using the plotly [74] and pandas [75] packages. Feedback was implemented with Google Forms, although any other similar option can be integrated.
4. Experiments and Results
4.1. Description of the Experiments
We performed two types of experiments: (a) in the first set of experiments, a cohort of students enrolled in four pilot courses (see Section 4.1.1) and instructors (see Acknowledgements) assessed the quality of the platform as an AI assistant and evaluator on a range of criteria described in Section 4.2; (b) in the second set of experiments, we evaluated the correctness and faithfulness of the AI components’ answers on a set of classic LLM metrics (see Section 4.3).
The statistics extracted from the results of the cohort of students will constitute a third experiment, which will be reported in a new paper at the end of the courses.
4.1.1. Pilot Courses Evaluated on Our Solution
For our study, we used the four courses presented in Table 3.
These courses are part of the Undergraduate program in Automatics and Applied Informatics offered by the Faculty of Engineering of the Constantin Brancusi University (CBU) of Targu-Jiu, located in Romania.
The first two courses are linked, as the first covers introductory SQL topics in the field of database design and administration, and the second course extends to techniques for designing applications that process databases in PL/SQL. In CBU, these are the students’ first introduction to databases. In the third course, students learn the Java programming language, and in the fourth course, students focus on applied programming techniques for software development.
Throughout these courses, students receive both theoretical and practical learning materials and participate in practical laboratory activities where they work on hands-on tasks.
Our research presents the usability and reliability of the proposed AI framework when applied to these courses.
4.1.2. Sample
The cohorts of students enrolled in the pilot courses in the new academic year and their distribution are summarized in Table 4 below.
Students who did not pass these courses in previous years were not included in this research, ensuring that the students in the sample had no prior knowledge of the subject.
4.1.3. Description of the Classical Teaching Process
To establish a baseline, we will describe the traditional teaching process for the courses involved in this study.
Each course consisted of weekly lectures and practical laboratory sessions, both lasting 2 hours over a 14-week period. After each lecture, in which theoretical notions were presented with useful examples, students participated in practical laboratory activities. During these sessions in the laboratory, they individually executed the code sequences demonstrated in the lecture and then solved additional tasks based on the presented concepts.
After 7 weeks, students took a 60 min mid-term assessment, with the goal of keeping them motivated on continuous learning and to be able to identify those struggling at an early stage.
The mid-term assessment had two parts: (a) 15 min for 10 single-choice questions; (b) 45 min allotted for 2 coding exercises directly on the computer.
The final exam, at the end of the 14 weeks of the course, took 120 min and was structured similarly: (a) 40 min for 20 single-choice questions; (b) 80 min for 2 coding exercises.
Following the final exam, students were asked to fill in a questionnaire on how they used the teaching materials and also to evaluate the teaching assistant. In this way, the effects of these traditional teaching resources such as textbooks, books, problem books, both in print and on the web, could be highlighted.
It is worth mentioning that during this final assessment, students were monitored by the lecturer and the teaching assistant and advised to use only the materials previously presented in the lectures and in the laboratory activities. The use of messaging applications and AI tools, such as ChatGPT, was not allowed during the classical assessment.
4.1.4. Description of the AI-Enhanced Teaching Process (Pilot)
In this pilot experiment, the AI framework was added as an additional support to the existing classical teaching process. So, in addition to the lectures and laboratory, the students had access to our platform. They could use it to reread the course, ask questions to the AI assistant (Module 1), and prepare for the evaluations in Module 2—Prepare for Evaluation. The mid-term and final evaluation were passed and graded in Module 3—Evaluation, and the student feedback recovered in Module 4—Feedback.
4.2. Results: Evaluation of the Platform by Instructors and Students
The experiment was launched this year with the students enlisted in the four pilot courses described in Section 4.1.2 and a cohort of seven high school and ten university teachers to assess the quality of the proposed learning platform. We present the summary of assessments of the students and instructors involved.
The evaluation criteria were derived by synthesizing the most prevalent qualitative metrics relevant to our application from the existing literature on computer- and AI-assisted educational technologies [16,17,23,41]. These criteria facilitate an assessment of perceived quality from both student and teacher perspectives regarding the responses provided by the AI assistant, as well as the overall usability and effectiveness of the educational application.
4.2.1. Perceived Advantages and Disadvantages for Instructors
In the first phase, the framework and the web application were evaluated by the instructors mentioned in the Acknowledgements section. They assessed the application based on the criteria listed in Table 5 and were also asked to provide open-ended feedback.
Below is the prevalent free-form feedback recorded from the teachers: This application greatly simplifies the migration of their existing course material to an online/AI-enhanced application, an obstacle which was, in their opinion, insurmountable before being presented with this framework. The ability to deploy the application on a university server or cloud account avoids many of the issues related to student confidentiality. They appreciated the reduction in time spent on simple questions and grading which permits them to focus on more difficult issues.
4.2.2. Perceived Advantages for the Students
We used the feedback form to obtain initial student feedback to the questions in Table 6.
Additionally, we extracted the following free-form feedback: Students consider a major benefit of this platform to be that they can ask any question they might hesitate to ask during class (so-called “stupid-questions”) while having the same confidence in the answer as if they were asking a real teacher. They appreciate that each answer highlights the relevant sections in the text, which increases their confidence in the AI assistant’s answer. They appreciate that the application can be used on mobile phones, for example, during their commutes or small breaks.
4.3. Results: Testing of the AI Components of the Platform
To test the performance of the AI modules, we used a dataset composed of the following (Table 7): A total of 16 single-choice questions from previous exams. A total of 40 free-answer questions. -. A total of 16 questions from previous exams (Manual Test 1 and Test 2) same as the single-choice ones from above, but we deleted the possible answers and asked the AI to answer in free form; -. A total of 24 questions generated with o3-mini-high with low, medium, high difficulty settings.
For single-choice answers, 100% answer correctness was obtained if the context was properly extracted by the RAG, so the rest of the analysis focused on more difficult free-answer questions.
We present the results on the AI assistant in Section 4.3.1 and on the AI evaluator in Section 4.3.2; the common results are presented in Section 4.3.3 and summarized AI results in Section 4.3.4.
4.3.1. AI Assistant (Module 1) Assessment
The assistant was graded both manually and using Ragas [81,82], a specialized library for evaluation of RAG-type specialized assistants.
Manual tests. For the manual tests, we evaluated only the final answer, with two human experts who were both familiar with the course material. We evaluated a single metric “answer_correctness” in a binary mode (correct or incorrect). Incomplete answers were labeled as incorrect. Due to inherent subjectivity in interpreting answers, as well as due to human error when handling large sets of data (250 rows), the initial evaluations on the same questions were different in about 5% of the cases (95% consistency). These inconsistencies were discussed, and the agreed answers were considered correct.
Automated tests. The assistant was evaluated automatically against two types of metrics [81,82]: Retrieval (Contextual) Metrics, i.e., whether the system “finds” the right information from an external knowledge base before the LLM generates its answer. The metrics used were as follows: Context Precision—measures whether the most relevant text “chunks” are ranked at the top of the retrieved list. Context Recall—evaluates whether all relevant information is retrieved. Generation Metrics, i.e., whether the LLM must have an answer that is not only fluent but also grounded in the retrieved material. The metrics we employed were as follows: Answer Relevancy—how well the generated answer addresses the user’s question and uses the supplied context. It penalizes incomplete users or unnecessary details. Answer Faithfulness—whether the response is factually based on the retrieved information, minimizing “hallucinations”, estimated either with ragas or human evaluation. Answer Text Overlap Scores (conventional text metrics BLEU, ROUGE, F1 [82])—compare generated answers against reference answers.
We compared the results using five different LLM backbones, from Gemini 1.0 to Gemini 2.0 Pro. All LLMs performed well in terms of answer correctness, matching or surpassing the human experts (Table 8).
Furthermore, we present in Figure 6 a split on question difficulty and the question generation (AI or manual).
The analysis of the results (Figure 6) leads to these main conclusions: Correctness is very high for all LLMs, with results on par with 1 expert. The answer relevancy results are very promising as well, having mostly scores above 80% relevancy, as observed by human raters in the HELM study [83]. Context retrieval is very important; results are better when more context is provided, which is expected and natural [82]. For faithfulness, we extracted two trends: (a) the faithfulness is better for higher difficulty questions; (b) faithfulness increases for newer LLMs, Gemini 2.0 Pro being the best. Gemini 1.0 and 1.5 will sometimes ignore the instruction to answer only from context. Older metrics are not relevant: NLP (non-LLM) metrics like ‘ROUGE’, ‘BLUE’, ‘factual correctness’ are no longer suited for evaluation of assistant performance (see Appendix A.1 with full results and [83]). The main explanation is that two answers can correctly explain the same idea and obtain a high answer relevancy but use very different words which will cause bleu. rouge and factual correctness to be very low.
We provide below a detailed discussion of the correctness, relevancy, faithfulness, and context retrieval in the context of the AI assistant.
Correctness and Relevancy. The correctness of the answers is on par or better than expert level, and the relevance also is on par with HELM [83]; we observed here just the facts known in the last 6 months, that LLM solutions are now on par or better than human experts in most domains.
Faithfulness. The analysis of faithfulness helped us understand an initially puzzling result in the raw data, in which Gemini 1.0 gave better results than Gemini 2.0, although only by a very small margin. After observing the faithfulness graphs, we noted that Gemini 1.0 and 1.5 generations of models were not as faithful as expected, the main reason being that they did not respect the instruction given in the prompt “Please do not answer if the answer cannot be deduced from the context received”, while Gemini 2.0 was much better at reasoning and respecting instructions. After closer analysis of the cases where Gemini 1.0 and 1.5 answered correctly but Gemini 2.0 did not provide a response, we found that the information was not present in the retrieved context, and Gemini 1.0 and 1.5 were responding from their own knowledge, without respecting the prompt to answer “only from the provided context”. Thus, the actual “correct” response in the given context was provided by Gemini 2.0.
We retested and confirmed this hypothesis by adding a set of questions which were not related to the given class document. While the RAG extracted no context, Gemini 1.0 and 1.5 gave answers to more than 60% of the questions which should have not been answered, while Gemini 2.0 correctly responded that “I am not able to answer in the context of this class”. We removed such cases from the rest of the analysis, but these cases are saved and can be found in the raw data (link in Appendix A).
Context retrieval. Context retrieval is very important for RAG LLMs, so we dedicated a small section to describe the result. It makes utilization in specific contexts possible, and it reduces the costs because it includes only the relevant context in the prompt sent to the LLM.
We tested this with two methods: one with chunks limited to 3000 tokens, and one with pages, usually limited at 500 tokens.
In our cases, the extraction of context with chunks of 3000 tokens with an overlap value of 300 tokens was always superior to that with pages. This outcome was most likely because we have six times more context and because sometimes ideas are split on two or more pages.
The results in Figure 7 show the following: (a) the correct answers of the system drop by 5–7%, with a higher drop on more difficult questions; (b) the metric “context recall” measured by ragas drops drastically with smaller context.
We are aware that our RAG framework has room for improvement, and we will continue to update it in the future. By improving it, the results will be more satisfactory, and the costs will be reduced, as less but more relevant context is sent to the LLM.
4.3.2. AI Evaluator Assessment (Module 2 and 3)
To assess our AI evaluator, similar to [84] and following the best practices mentioned in [85], we used the same 16 single-choice and 40 free-answer questions to which we added reference answers (ground truth) and student answers. With this setup built, we performed manual and automatic evaluation.
The results of the evaluator grading:
In manual mode (using two human experts): Evaluator grading to free-form questions: correctness 90%. Evaluator grading to single-choice questions: correctness 100%. Relevancy of evaluator suggestions to wrong questions: relevance 99%.
In automatic mode (by using ChatGPT O1 as a judge): Comprehensiveness (whether the response covers all the key aspects that the reference material would demand): 75%. Readability (whether the answer is well-organized and clearly written): 90%.
4.3.3. Common Benchmarks
Certain considerations, such as stability and language effect, apply to both the assistant and evaluator, so we present them separately.
Stability and faithfulness. The platform was configured and empirically adjusted to optimize user experience in the context of AI-enhanced learning: (a) answer quality—it is instructed to respond “I am not able to answer this question in the class context” if it is unsure about its answer; (b) stability and consistency—we set the LLM temperature to 0 and steps to 0 to reduce the variability in the LLM answer [86].
We evaluated the stability by rerunning the same 10 questions 10 times. As all answers were equivalent, we estimated the instability to be below 1%.
Translation Effect. We observed a small effect on changing the languages (our first test was on a class in Romanian). Still, this effect was almost nonexistent for newer LLM backbones (almost unmeasurable for Gemini 2.0), as these backbones have improved their multilanguage ability [87].
However, there are two important effects on the RAG component: (a) First, you need an embedding model which is multilanguage. We used distiluse-base-multilingual-cased-v2 embeddings [88] to accommodate content in all languages. (b) Second, if the course documentation is in English, and the question is in French, the vector store would be unable to retrieve any relevant context.
To address this, we have a few options which can be implemented: (1) require that the questions are in a fixed language (usually the same language as the course documentation), configured by default; (2) translate each question to the language the class documentation is in; (3) have all the class documentation in the database translated in a few common languages at the setup phase.
4.3.4. Summary of LLM Results
We obtained 100% correct answers on single-choice questions and 95–100% correct answers on free-form results, which surpass human-expert level. We observed that the performance is strongly influenced (>10%) by the context retrieval performance.
Based on the reported results of the application performance, we consider that this application can be used in the current state for high school and university level.
Going forward, we can focus our improvements on three directions: (1) improve RAG performance to ensure that the LLM receives all relevant context for the questions; (2) reduce LLMs costs by providing only the relevant context; (3) upgrade to better-performing LLMs.
5. Discussion
From our review study, we found that AI assistants and AI evaluators are a useful and needed addition to the classical teaching methods (See Section 2).
The implementation’s accuracy/faithfulness of our proposed solution (See Section 3) was more than satisfactory with current LLMs (See Section 4.3), and its usefulness and ease of use was evaluated as excellent by both instructors and students (See Section 4.2).
While the introduction of this framework as an extension to classical courses seems both beneficial and needed, we still have to consider the obstacles to adoptions (See Section 2.5), mainly the technical adoption barrier, costs, competing solutions, and legal bureaucracy.
5.1. Technical Adoption Barrier
This application was designed to be very easy to use and adapted for any non-technical user, in particular because its deployment is a “one-click” process, and its UI is designed with intuitiveness in mind. The feedback of the instructors and students (See Section 4.2) confirmed that we achieved this goal and that the application is easily accessible for even the least technical person.
5.2. Cost Analysis
This app is open-source and free to use by any university. Still, there are two main costs: for hosting the platform and for LLM usage.
As shown in Section 3.5, the app requires a system with at least 2 GB of memory. This can be found in almost any university and is usually offered in the free tier for most cloud providers.
To evaluate the LLMs cost, we consider that in standard STEM courses, we have 15 chapters of lectures with 3 evaluations for each 1 and roughly 50 students who might ask 10 questions per chapter, adding up to ~10 k questions. To answer each question, the RAG will augment each question (originally 50 tokens/word) to around 1000 tokens and provide an answer around 50 tokens long, resulting in ~10 M input tokens and ~45 k output tokens.
In Table 9, we present a detailed comparison table of the LLMs’ related cost computed for the main models on the market for the above case (1 course/15 chapters).
We observed that one of the best solutions we tested (Gemini 2.0) gave a cost/course of only USD 1. While this might still be a barrier in some demographics, these costs are only dropping exponentially, with the cost/token reduced in half every 6 months [84]. Furthermore, we plan to establish a collaboration which will sponsor at least some of these costs.
5.3. Competing Solutions
We investigated whether this solution can bring any benefit with respect to the existing solutions. As a solution providing online courses enhanced with an AI assistant to students, our open educational platform can be seen as both an extension of and complement to Massive Open Online Courses (MOOCs). In Table 10, we compare learning platforms which implement some form of computer- or AI-assisted teaching, ranging from purely academic MOOCs (Stanford Online, edX) to professional-training platforms (Coursera, Udemy).
While not a MOOC in the traditional sense, our platform fills a specific gap in the current educational-technology landscape. Whereas platforms like Coursera, edX, or Udemy focus on centralized course delivery, our framework is decentralized and provides advantages such as low cost, ease of setup, facility of utilization, and the integration of an AI assistant/evaluator. This space of low-cost, open-source, AI educational solutions which our framework is targeting is practically not addressed by any of the existing applications, which makes us think that the launch of our platform is both needed and beneficial.
5.4. Legal and Governance Issues
There are still gaps in legislation and policies related to the usage of AI in student education and the confidentiality of student data. These gaps are being addressed in different countries in recent years and should be reduced progressively in the near future. Still, we consider that most of these impediments are avoided in our application because the students are already enrolled in the high school or university courses.
Therefore, we think our application has an advantage over all existing educational platforms, whether in cost, technical adoption, policy barriers or possible reach. We estimate that this framework can reach more than 90% of the world’s students and instructors, including demographics otherwise unreachable by existing solutions.
As a next step, we propose to give as much exposure as possible to our proposed application, most probably in the form of a collaboration with public and private institutions, to make it available for free in any high school and university. Results obtained at the end of the pilot phase will help us better quantify the effect on student results (improvement in grades, time spent learning, etc.) and will contribute to the adoption of the platform.
6. Conclusions
We evaluated the current status of computer-aided methods in education, including AI approaches. The AI methods offer significant benefits, but there are major barriers to their adoption related to costs and technical literacy of the instructors.
To address these challenges, we created an easy-to-use AI framework, composed of an AI assistant and AI evaluator. This platform enables instructors to migrate existing courses with a simple drag-and-drop operation, effectively overcoming the “technical literacy” barrier. It provides a wide range of advantages (See Section 4.2) such as near 100% accuracy (See Section 4.3), high consistency, low costs (estimated at USD 1/year/class), and fewer policy barriers as it is an open-source solution which can be fully controlled by the educational institution. From the student perspective, it has significant advantages such as 24/7 availability enabling a flexible learning schedule, mobile device accessibility, increased answer accuracy and consistency, and a lowered teacher–student barrier.
Our solution compares positively with all existing solutions. The combination of the AI-enhanced learning experience, low-cost maintenance, open-source licensing, and excellent performance makes us strongly believe that this application can see widespread adoption in the coming years, contributing significantly to the democratization of the educational system.
Conceptualization, A.R. and A.B. (Adrian Balan); methodology, A.R. and A.B. (Adrian Balan); software, A.R., A.B. (Adrian Balan), and L.G.; validation, A.R., A.B. (Adrian Balan), and L.G.; formal analysis, A.R. and A.B. (Adrian Balan); investigation, A.R., A.B. (Adrian Balan), M.-M.N., C.C., and L.G.; resources, A.R., A.B. (Adrian Balan), and L.G.; data curation, A.R., A.B. (Adrian Balan), and L.G.; writing—original draft preparation, A.R., A.B. (Adrian Balan), I.B., L.G., and A.B. (Aniela Balacescu); writing—review and editing, A.R., A.B. (Adrian Balan), I.B., L.G., M.-M.N., C.C., and A.B. (Aniela Balacescu); visualization, A.R., A.B. (Adrian Balan), M.-M.N., C.C., and L.G.; supervision, A.R., A.B. (Adrian Balan), I.B., L.G., and A.B. (Aniela Balacescu); project administration, A.R., A.B. (Adrian Balan), and L.G.; funding acquisition, A.R., A.B. (Adrian Balan), I.B., L.G., M.-M.N., C.C., and A.B. (Aniela Balacescu). All authors have read and agreed to the published version of the manuscript.
Not applicable.
Informed consent was obtained from all subjects involved in the study.
The code used in this paper is found at:
We thank the following instructors for their contribution in the evaluation of the framework: A.B. (Adrian Balan), A.R., L.G., A.B. (Aniela Balacescu), I.B., M.-A.R., L.-G.L., F.G., A.L., G.G., M.I., M.-M.N., R.S., M.R., A.I., A.L. We thank the students involved in the 4 pilots described in
Author Laviniu Gavanescu was employed by the company Laviniu Gavanescu PFA. Author Adrian Balan was employed by the company AI Research Laboratory. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
The following abbreviations are used in this manuscript:
| AI | Artificial intelligence |
| SQL | Structure Query Language |
| LLM | Large language model |
| CAME | Computer-Assisted Methods in Education |
| CAI | Computer-Assisted Instruction |
| CEI | Computer-Enhanced Instruction |
| VR | Virtual Reality |
| AR | Augmented Reality |
| LMS | Learning Management System |
| NLP | Natural Language Processing |
| ChatGPT | Chat Generative Pre-Trained Transformer developed by OpenAI |
| Gemini | Generative artificial intelligence chatbot developed by Google |
| Claude | a family of large language models developed by Anthropic |
| LLAMA | a family of large language models (LLMs) released by Meta AI |
| LLMProtocol | Large Language Model Protocol |
| RAG | Retrieval Augmented Generation |
| RAGProtocol | Retrieval Augmented Generation Protocol |
| LLMGemini | Large Language Model Gemini |
| Claude Sonnet | Claude 3.5 Sonnet |
| VertexAI | Vertex AI Platform |
| ECTS | European Credits Transfer System |
Footnotes
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Figure 1 Chatbot workflow diagram. (a) General NLP knowledge base retrieval + AI. (b) Simplified AI-only system.
Figure 2 Application user interface: left—navigation, middle—content, right—AI assistant.
Figure 3 Technical diagram of components and flow of our application.
Figure 4 Diagram of the RAG interfaces.
Figure 5 Flow of a question through the system.
Figure 6 Answer correctness (top), relevancy (middle), and faithfulness (bottom) split by question difficulty and the question generation mode (AI or manual) for all LLM Backends.
Figure 7 Comparison on the two splitting methods (chunks and pages) for the metrics context recall (left) and answer correctness (right).
Chatbot classification.
| Classification | Type | Description | Example |
|---|---|---|---|
| Knowledge Domain | Domain-Specific | Specialized in a specific field or industry. | Python-Bot, Babylon Health |
| General-Purpose | Broader functionality across multiple domains. | Siri, Alexa | |
| Service Provided | Informational | Provides facts, updates, and general information. | FIT-EBot |
| Transactional | Handles tasks like booking, purchasing, or transactions. | Erica by BOA | |
| Support/Assistant | Assists with troubleshooting or performing complex tasks. | IT support bots | |
| Goals | Task-Oriented | Designed to complete specific user-defined tasks. | BlazeSQL AI |
| Conversational | Focused on engaging in natural dialogue with users. | ChatGPT | |
| Learning/Training | Educates or trains users in specific domains. | Percy | |
| Input/Response | Rule-Based | Operates on pre-written scripts and decision logic. | ELIZA |
| AI-Powered | Leverages AI and NLP to provide dynamic, context-aware responses. | OpenSQL.ai | |
| Hybrid | Combines rule-based and AI features for versatile interaction. | Many enterprise chatbots |
Most common educational chatbots.
| Chatbot Name | Purpose | Features |
|---|---|---|
| Python-Bot [ | Python-Bot is a learning assistant chatbot designed to assist novice programmers in understanding the Python programming language. | User-Friendly Interface: Offers an intuitive platform for students to interact with, making the learning process more accessible. |
| Educational Focus: Provides detailed explanations of Python concepts, accompanied by practical examples to reinforce learning. | ||
| Interactive Learning: Engages users in a conversational manner, allowing them to ask questions and receive immediate feedback. | ||
| EduChat [ | EduChat is an LLM-based chatbot system designed to support personalized, fair, and compassionate intelligent education for teachers, students, and parents. | Educational Functions: Enhances capabilities such as open question answering, essay assessment, Socratic teaching, and emotional support. |
| Domain-Specific Knowledge: Pre-trained on educational formats to provide accurate and relevant information. | ||
| Tool Integration: Fine-tuned to utilize various tools, enhancing its educational support capabilities. | ||
| GPTeens [ | GPTeens is an AI-based chatbot developed to provide educational materials aligned with school curricula for teenage users. | Interactive Format: Utilizes natural language processing to support conversational interactions with learners. |
| Age-Appropriate Design: Delivers responses suitable for teenage users. | ||
| Curriculum Integration: Trained on educational materials aligned with the South Korean national curriculum. | ||
| EduBot [ | Curriculum-Driven EduBot is a framework for developing language learning chatbots for assisting students. | Topic Extraction: Extracts pertinent topics from textbooks to generate dialogues related to these topics. |
| Conversational Data Synthesis: Uses large language models to generate dialogues, which are then used to fine-tune the chatbot. | ||
| User Adaptation: Adapts its dialogue to match the user’s proficiency level, providing personalized conversation practice. | ||
| BlazeSQL [ | BlazeSQL is designed to transform natural language questions into SQL queries, enabling users to extract data insights from their databases without extensive SQL knowledge. | AI-Driven Query Generation: Utilizes advanced AI technology to comprehend database schemas and generate SQL queries based on user input in plain English. |
| Broad Database Compatibility: Supports various databases, including MySQL, PostgreSQL, SQLite, Microsoft SQL Server, and Snowflake. | ||
| Privacy-Focused Operation: Operates locally on the user’s desktop, ensuring that sensitive data remain on the user’s machine. | ||
| No-Code Data Visualization: Generates dashboards and visualizations directly from query results, simplifying data presentation. | ||
| OpenSQL.ai [ | OpenSQL.ai aims to simplify the process of SQL query generation by allowing users to interact with their databases through conversational language. | Text-to-SQL Conversion: Transforms user questions posed in plain English into precise SQL code, facilitating data retrieval without manual query writing. |
| User-Friendly Interface: Designed for both technical and non-technical users, making database interactions more accessible. | ||
| Efficiency Enhancement: Streamlines data tasks by reducing the need for complex SQL coding, thereby increasing productivity. | ||
| AskYourDatabase [ | AskYourDatabase is an AI-powered platform that enables users to interact with their SQL and NoSQL databases using natural language inputs, simplifying data querying and analysis. | Natural Language Interaction: Allows users to query, visualize, manage, and analyze data by asking questions in plain language, eliminating the need for SQL expertise. |
| Data Visualization: Instantly converts complex data into clear, engaging visuals without requiring coding skills. | ||
| Broad Database Support: Compatible with popular databases such as MySQL, PostgreSQL, MongoDB, and SQL Server. | ||
| Self-Learning Capability: The AI learns from data and user feedback, improving its performance over time. | ||
| Access Control and Embeddability: Offers fine-grained user-level access control and can be embedded as a widget on websites. |
Course description.
| Course | Semester | ECTS | Hours | Year |
|---|---|---|---|---|
| Databases (DBs) | 4th | 4 | 56 | 2020–2025 |
| Database Programming Techniques (DBPTs) | 5th | 5 | 56 | 2020–2025 |
| Object-Oriented Programming (OOP) | 3rd | 6 | 70 | 2020–2025 |
| Designing Algorithms (DAs) | 2nd | 4 | 56 | 2021–2024 |
Pilot courses evaluated on our application.
| Course | Students | %Male | %Female |
|---|---|---|---|
| Databases (DBs) | 37 | 89% | 11% |
| Database Programming Techniques (DBPTs) | 36 | 91% | 8% |
| Object-Oriented Programming (OOP) | 37 | 89% | 11% |
| Designing Algorithms (DAs) | 49 | 83% | 16% |
Teacher feedback, grading 1 min–5 max.
| Criteria | Question | Eval |
|---|---|---|
| 1. Ease of Application Setup | How would you rate the ease of setting up the application, including adding courses, creating exam questions, and generating exam questions automatically? | 5/5 |
| 2. Chatbot Answer Quality | How do you rate the accuracy and relevance of the chatbot’s responses to questions in Module 1? | 4.8/5 |
| 3. Evaluator’s Judgment Accuracy | How do you rate the quality and fairness of the evaluator’s assessment of student answers? | 4.1/5 |
| 4. Evaluator Hints for Wrong Answer | How useful are the evaluator’s hints when a student selects multiple answers in a question? | 3.2/5 |
Student feedback.
| Criteria | Question | Eval |
|---|---|---|
| 1. Usability | How would you rate the ease of using and accessing the Module 1/2/3 | 4.2/5 |
| 2. Chatbot Answers Clarity | How easy to understand are the chatbot answers and suggestions? | 4.9/5 |
| 3. Chatbot Answers usefulness | How often are the chatbot fully answering your question? | 5/5 |
| 4. Bugs | How often the application was irresponsible or crashed? | 2.9/5 |
Test data.
| Source | Difficulty | Questions No | Type |
|---|---|---|---|
| Manual Test1 (2023 Exam) | 1 | 8 | Single-choice |
| Manual Test2 (2023 Exam) | 1 | 8 | Single-choice |
| O3-mini-high | 1 | 6 | Free-answer |
| O3-mini-high | 2 | 6 | Free-answer |
| O3-mini-high | 3 | 6 | Free-answer |
Results: overall correctness for the five backbones.
| LLM | Gemini-1.0-Pro | Gemini-1.5-Flash | Gemini-1.5-Pro | Gemini-2.0-Flash | Gemini-2.0-Pro |
|---|---|---|---|---|---|
| Correct % | 0.95 | 0.975 | 0.975 | 1 | 0.975 |
Estimated costs for existing LLMs for 1 course (10M input tokens, 45 K output tokens).
| Model Variant | Cost per 1M Input Tokens | Cost per 1M Output Tokens | Cost for 10 M Input Tokens | Cost for 45 K Output Tokens | Total Estimated Cost |
|---|---|---|---|---|---|
| Gemini 1.5 Flash | USD 0.15 | USD 0.60 | USD 1.50 | USD 0.03 | USD 1.53 |
| Gemini 1.5 Pro | USD 2.50 | USD 10.00 | USD 25.00 | USD 0.45 | USD 25.45 |
| Gemini 2.0 Flash | USD 0.10 | USD 0.40 | USD 1.00 | USD 0.02 | USD 1.02 |
| Claude 3.5 Sonnet | USD 3.00 | USD 15.00 | USD 30.00 | USD 0.68 | USD 30.68 |
| Chat GPT-4o | USD 2.5 | USD 20 | USD 25.00 | USD 0.2 | USD 25.2 |
| DeepSeek (V3) | USD 0.14 | USD 0.28 | USD 1.40 | USD 0.01 | USD 1.41 |
| Mistral (NeMo) | USD 0.15 | USD 0.15 | USD 1.50 | USD 0.01 | USD 1.51 |
Comparison to existing solutions.
| Criteria | Current App | Coursera [ | Stanford [ | Udemy [ | edX [ |
|---|---|---|---|---|---|
| Cost for Student | Free | Approx EUR 40 per course per month | Tuition-based, varies by program | Varies (EUR 10–200 per course) | Free courses, Paid certificates |
| Cost for Teacher/University | Free < 1 year LLM tokens | Free for universities; revenue-sharing for instructors | Salary-based | Instructors can set prices or offer courses for free | Free for universities |
| Ease of Use | High | High | High | High | High |
| Video Content | Real classes available | Yes | Yes | Yes | Yes |
| AI Assistant | Yes | Yes | No | No | No |
| AI Evaluator | Yes | Yes | No | No | Yes |
| Possible Reach (Students) | 1 billion | Approx. 148 million registered learners | Enrolled students–20 k/year | Over 57 million learners | Over 110 million learners |
| Possible Reach (Teachers) | 97 million total | Thousands of instructors | Limited to faculty members | Open to anyone interested in teaching | University professors |
Appendix A
Appendix A.1
Full results from the Module 1 evaluation of chunks and pages RAG strategy.
| Row Labels | Average of Corect | Context_Recall | Faithfulness | Answer_Relevancy | Bleu_Score | Rouge Score | Factual Correctness |
|---|---|---|---|---|---|---|---|
| chunks | |||||||
| gemini-1.0-pro | 95.00% | 81.25% | 71.85% | 86.39% | 8.34% | 22.99% | 56.00% |
| gemini-1.5-flash | 97.50% | 82.50% | 74.13% | 87.09% | 14.85% | 31.00% | 57.11% |
| gemini-1.5-pro | 97.50% | 83.75% | 74.38% | 83.31% | 13.40% | 35.69% | 43.77% |
| gemini-2.0-flash | 100.00% | 77.50% | 82.92% | 84.82% | 15.08% | 35.93% | 46.85% |
| gemini-2.0-pro | 97.50% | 82.50% | 87.61% | 85.12% | 10.01% | 24.76% | 44.40% |
| pages | |||||||
| gemini-1.0-pro | 47.08% | 74.05% | 45.12% | 3.09% | 32.04% | 21.85% | |
| gemini-1.5-flash | 47.08% | 57.83% | 74.76% | 7.81% | 20.64% | 47.13% | |
| gemini-1.5-pro | 47.08% | 55.26% | 83.79% | 5.70% | 16.87% | 39.73% | |
| gemini-2.0-flash | 47.08% | 88.79% | 69.63% | 4.35% | 17.42% | 33.75% | |
| gemini-2.0-pro | 47.08% | 87.78% | 78.15% | 3.83% | 14.45% | 39.45% | |
| Grand Total | 97.50% | 64.29% | 75.47% | 77.82% | 8.65% | 25.18% | 43.28% |
Appendix A.2
Full data used for tests can be found at:
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.