INTRODUCTION
Students frequently make errors while solving mathematical problems. Rather than merely identifying and correcting mistakes, teachers should meticulously mark each error within a student's answer to better understand the underlying difficulties, rather than simply marking the entire answer as incorrect. This detailed grading approach enables teachers to quickly and clearly pinpoint specific student errors, allowing for more targeted guidance. In traditional educational settings, education experts typically analyze student errors to help educators better understand these mistakes. With the advent of machine learning, some research efforts have focused on collecting large amounts of student data to identify specific error patterns (Jarvis, Nuzzo-Jones, and Heffernan 2004). This data-driven approach aims to determine the most effective instructional strategies to address students' skill deficits or misunderstandings (Feldman et al. 2018; Haryanti et al. 2019; Priyani and Ekawati 2018). However, existing methods face several significant limitations:
- 1.Time-Consuming and Labor-Intensive: They typically require education experts to provide one-on-one tutoring with students or to build systems that analyze long-term student work to identify common patterns. This process is both challenging and time-consuming (Brown et al. 2016), remaining a significant obstacle for educators.
- 2.Limited Generalizability: Error types identified for individual students are difficult to generalize to others. Each student must undergo the same complex process, making it challenging to apply insights across a broader student population.
- 3.Restricted Applicability: Different subjects and grade levels may require distinct systems and experts, meaning that the identified causes of errors often only apply to the specific data that has been collected.
Recently, large language models (LLMs), such as GPT-4 (OpenAI. 2023), have made remarkable strides in mathematical reasoning, especially in solving mathematical word problems (MWPs) (Roy and Roth 2018). These models excel in comprehending complex numerical contexts and multi-step reasoning, with GPT-4 achieving a 97% accuracy rate on the GSM8K dataset (Zhou et al. 2024). However, current research predominantly focuses on problem-solving abilities, including the correctness of answers and the consistency of intermediate reasoning steps. In the EdTech sector (Wang et al. 2024, 2021; Xu et al. 2024), problems obtained from various sources often come with detailed solutions, although the quality of these solutions can vary. Therefore, the accuracy of answers is relatively straightforward to assess. Instead, identifying and correcting errors—a crucial aspect for enhancing educational efficiency—has been underestimated or overlooked. To maximize educational effectiveness, it is essential to accurately identify and address individual student problems. Given the rapid advancements in deep learning, we explore whether the potential of large models in mathematical reasoning can be harnessed to realize the following vision:
This innovative approach has the potential to drastically lower educational costs while simultaneously improving teaching efficiency and accessibility. To achieve this, we present the design, implementation, and deployment of a virtual AI teacher focused on student error correction. As illustrated in Figure 1, our system supports two types of inputs. The first type involves learning tasks performed by users on learning machines, such as exams. These tasks typically include correct answers, enabling the model to compare users' responses with the reference answers to determine correctness and subsequently analyze the causes of errors. The second type pertains to photo-based question searches conducted by users. To better support edge devices like smartphones and enhance user convenience, we deploy a lightweight MLLM to provide answers. If the user's photo contains their problem-solving process, we compare the MLLM's response with the user's answer to assess correctness and perform error cause analysis accordingly. In this section, we first introduce our method for error cause analysis.
[IMAGE OMITTED. SEE PDF]
As depicted in Figure 2, this system uniquely incorporates student drafts as the primary analysis target, deepening the understanding of each student's learning process. By harnessing the power of large language models and advanced prompt engineering, the system evaluates individual student performance, pinpointing and analyzing the root causes of errors with high accuracy. To optimize cost and efficiency, we maintain an error pool that catalogs historical errors, enabling the system to minimize computational demands when matching student answers to known errors. Furthermore, to enhance student engagement, we developed a real-time, multi-round AI dialogue system that allows students to effectively inquire about the knowledge points related to their problems. To the best of our knowledge, no other AI-based learning system offers such comprehensive feedback.
[IMAGE OMITTED. SEE PDF]
Compared to traditional teachers and machine learning-based error correction systems, our VATE system offers the following advantages:
- 1.Significant Reduction in Educational Costs: Currently, human teacher expenses constitute a large portion of course costs. By partially replacing human instruction with AI-based teaching, costs can be significantly reduced.
- 2.High Scalability: Typically, only teachers with advanced knowledge in a field can provide excellent educational resources. However, the VATE system, leveraging the knowledge capabilities of existing large models, can bypass this requirement and extend to various subjects and grade levels.
- 3.Generalizability: Our VATE system offers unparalleled capabilities in open-world error detection and recommendation, demonstrating superior generalization and practicality.
- 4.Flexible Educational Process: The VATE system allows students to start and stop their learning sessions at any time, ensuring that their educational experience remains uninterrupted by external factors.
Finally, we summarize the contributions of this paper:
- 1.This paper extends our previous work presented at IAAI-25, where we introduced the use of multimodal data, specifically student draft images, for error localization and analysis. In this version, we make two key advancements: (1) we introduce a snap-to-solve module that offloads low-reasoning tasks to edge-deployed LLMs for faster and partially offline processing, and (2) we provide further empirical evidence showing that multimodal large models significantly outperform traditional approaches such as OCR-based draft parsing or answer-only LLM analysis.
- 2.To build the VATE system, we applied many advanced techniques, including complex prompt engineering, dual-stream models for error analysis, and internal knowledge point graph comparisons. This allows the system to interact conveniently with users and provides error localization, error analysis, problem feedback, and flexible dialogue functions.
- 3.In terms of efficiency, we constructed an error pool to store historical answers and corresponding errors, avoiding repeated calls to large models for the same input, significantly improving access efficiency and reducing costs.
- 4.The VATE system has been deployed in Squirrel AI's learning machines and has accumulated millions of usage samples. In this paper, we present partial evaluation results, where the error analysis accuracy of the VATE system exceeds 75%. Post-deployment, students' learning efficiency and mastery of knowledge points have noticeably increased. Additionally, a satisfaction survey of sales personnel showed an overall satisfaction rating of over 8 (out of 10), indicating a strong positive impact on product sales.
RELATED WORK
LLM and MLLM for Mathematics. In recent years, both LLMs (Brown et al. 2020; Chiang et al. 2023; Jiang et al. 2023; OpenAI. 2023; Touvron et al. 2023) and MLLMs have experienced rapid development. With the emergence of models such as GPTs, Llama, and Vicuna, MLLMs have similarly undergone significant evolution (Fu et al. 2024; Yin et al. 2023; Zhang et al. 2024a, 2024b). Mathematical ability has consistently been a specific challenge in evaluating these large models, with some efforts focusing on addressing mathematical problems through fine-tuning or large-scale training on relevant mathematical data1. However, the majority of current work primarily emphasizes the ability to solve the problems correctly. For educators, determining the correct answers is often the simplest part, as we typically have ground truth answers before creating questions. The more challenging aspect is analyzing students' errors and providing appropriate feedback.
Error localization and analysis is a long-standing problem in educational technology. Early systems such as BUGGY (Brown and Burton 1978) constructed error databases based on student misconceptions and predicted future errors using rule-based reasoning. Later methods used machine learning to automatically generate error-detection rules, but these approaches struggled to generalize to novel errors not captured by their predefined rule sets. While recent work has broadened the analytic scope through complex module combinations (Feldman et al. 2018), these systems still tend to cover only a limited range of error types.
Multimodal large language models (MLLMs), with their few-shot learning capabilities and rich internal representations, offer new opportunities for open-world error identification and explanation. In our prior work presented at IAAI-25, we explored the use of student draft images and LLM reasoning for error diagnosis. This paper builds upon that foundation by introducing additional features such as edge-deployed snap-to-solve tasks and comprehensive benchmarking experiments to evaluate system accuracy and educational impact. While some benchmarks have been proposed for evaluating mathematical reasoning errors (Li et al. 2024b), they tend to use synthetic, simplified examples and do not reflect the complexity of real-world student submissions.
We attempt to build the VATE system using various engineering tools, leveraging the cognitive capabilities of large models to create an AI tool that can autonomously complete analysis and provide recommendations. To our knowledge, this is the first educational system capable of open-world, autonomous analysis and inference. After practical deployment and extensive user experience evaluations, VATE significantly improved student learning efficiency and received widespread recognition from sales personnel.
USE OF AI TECHNOLOGY
Multi-modal data collection
Using large models to judge errors in mathematical problem-solving steps is a very intuitive idea. In the initial phase of our project, we also attempted to use LLMs to directly analyze incorrect answers, including various complex prompt engineering techniques, such as inputting knowledge points, answers, explanations, and student answers. However, LLMs consistently failed to provide reliable error cause analysis and could not be used effectively. Due to the limited information available, LLMs could only guess where the student's intermediate steps went wrong based on the final answer, leading to a significant margin of error in these guesses. Therefore, we began advocating for students to fully document their calculation steps and upload their draft images to the backend, rather than just submitting the incorrect answers. As shown in Figure 2, when students do not upload their intermediate steps on the draft, we request that they redo the problem. Since implementing the policy of requiring students to write drafts, the proportion of students submitting detailed problem-solving processes has increased from 5% four months ago to approximately 60%. Enforcing the standard of submitting drafts not only improves students' focus while solving problems but also provides us with more options for subsequent processing. To date, we have collected over 24 million student drafts, a choice of immense value to the education industry. In our project, we have also verified that analyzing drafts containing intermediate steps greatly enhances the effectiveness and reliability of applying large models for error analysis.
Dual-stream large model error cause analysis
The framework is illustrated in Figure 3, which includes two models that perform draft analysis and error cause analysis with recommendations. When a student's answer is marked incorrect, their draft, the problem, the problem explanation, and the correct answer are processed step by step, resulting in an analysis of the error causes and suggestions for follow-up learning.
- 1.
Effective Utilization of Draft Data: We explored various approaches to effectively use draft data. Initially, before multimodal large models had achieved their current performance, we attempted to use existing OCR tools to analyze the drafts and then input them into an LLM for analysis. However, this approach was significantly limited by the performance of OCR tools, which could only recognize text and numeric information. These tools struggled to effectively interpret the spatial structure of the student's draft data and mathematical symbols, resulting in poor outcomes. Given the gradual improvement in the image understanding capabilities of multimodal large models, we later integrated these models into our system. As shown in Figure 3, we designed specific prompts for draft analysis and input them, along with the student drafts, into the multimodal large models. These models were then able to provide us with specific information contained in the drafts as well as descriptions of the problem-solving process.
- 2.Generating Detailed Error Causes and Suggestions: With the initial draft analysis complete, we naturally input the draft summary directly into the LLM, enabling it to infer the student's error causes based on the intermediate steps. However, at this stage, the information provided by only using the draft data and comparing the incorrect/correct answers was still insufficient for the LLM to perform a comprehensive error cause analysis. To overcome this limitation, we systematically experimented with nearly all entries in our database and identified the optimal system prompt. As illustrated in Figure 3, we combined the problem, its solution, the correct answer, the answer explanation, the student's incorrect answer, and the draft analysis according to the structure shown in the figure. We also added carefully designed guiding words, allowing the LLM to have a comprehensive understanding of both the correct answer and the student's complete steps. This approach enabled the LLM to generate a detailed analysis of the error causes and provide suggestions for the student.
[IMAGE OMITTED. SEE PDF]
Error cause analysis based on the error pool
Using dual-stream large models to analyze each student's answer can yield good results; however, given our large user base, there may be scenarios where tens of thousands of students simultaneously send requests. In such cases, calling the LLM becomes a performance bottleneck, leading to significant efficiency issues. Additionally, we observed that the distribution of student errors follows a long-tail pattern. We analyzed over 2000 representative errors from the error pools of the top 800 students ranked by error rate between April and May 2024. As shown in Figure 4, most incorrect answers fall into fewer than 40 categories, and the error causes for each incorrect answer are generally the same. For example, in Figure 3, the only reason for calculating as 598 is forgetting to add the final 89. For such errors, it is unnecessary to call the large model for reanalysis. Instead, we can use the analysis and responses stored in the historical error pool to provide suggestions and feedback to the student. This approach not only avoids the variance introduced by multiple calls but also improves the model's efficiency. Specifically, our error cause analysis using the error pool involves two steps:
- (1)
Error Pool Matching: Our error pool stores pairs of question IDs and student answers as hash keys, with each pair having a unique hash key. This means that we assume for each question, if students provide the same incorrect answer, their intermediate steps are likely similar, leading to the same error cause. This rule-based matching approach is common in traditional error analysis tools (Brown and Burton 1978). The assumption is reasonable and has been validated with an expert for lower-grade cases (e.g., K5). For more complex cases, the expert suggested that we might introduce additional hash keys, such as intermediate answers, if feasible. Nevertheless, the assumption remains valid for a substantial portion of problems. As shown in Figure 2, when a student provides an incorrect answer, if the question ID and student answer match an entry in the error pool, the system directly returns the precomputed error cause. Otherwise, the answer is passed on to our dual-stream large model for analysis.
- (2)Error Pool Update: Updating the error pool can be straightforward: whenever a new question ID and student answer are encountered, the error pool is expanded. However, sometimes students provide completely random answers that are of no value. Additionally, for more complex problems, there can be many possible errors. Without restrictions, this could lead to rapid expansion of the error pool, reducing retrieval efficiency. Therefore, we have implemented the following constraints for updating the error pool:
- 1.Quality: It is challenging to rely solely on answers. Fortunately, since we require students to upload draft content, we first ask the multimodal model to score the draft based on clarity, spatial utilization, organization, consistency, correction traces, and neatness. Drafts with low scores generally reflect random scribbles by students and provide little valuable reference. We only update the error pool with new question ID-student answer pairs when the draft quality meets our standards.
- 2.Quantity: We limit the error pool for each problem to a maximum of 100 entries to reduce the retrieval burden.
[IMAGE OMITTED. SEE PDF]
The primary benefit of the error pool is the reduction in computational costs. Let be the total number of problems in our system and be the maximum number of distinct student answers per problem. The total number of LLM API calls (as well as the amount of tokens consumed) is then upper-bounded by , independent of the number of users, trigger frequency, and so forth. This is especially advantageous when budgeting for multimodal LLM API calls, which are typically more expensive than pure text calls.
DIALOGUE SYSTEM DESIGN
To date, we have developed a robust and efficient error analysis and recommendation tool. However, how to interact effectively with users remains a critical challenge. In the process of educating students, it is not advisable to directly provide the complete error analysis and answers. Instead, it is more beneficial to guide students step by step to discover the issues themselves, thereby enhancing the learning effect. To achieve this, we have designed specific prompts for the large model. Similar to the initial error analysis stage, we input the problem, explanation, student's answer, and the model's error analysis and suggestions into the LLM. However, unlike the former, we impose a series of requirements on the LLM in this stage: (1) Avoid directly providing the answers; instead, guide students to find the answers themselves through questions and hints. (2) Adhere to a guided and heuristic teaching approach, encouraging students to engage in self-directed learning and critical thinking. (3) Maintain relevance in the dialogue, focusing on the issues and challenges the student is currently facing. (4) Respect the student's learning pace, avoiding overly rapid progression. (5) Refrain from discussing topics unrelated to learning.
With these requirements in place, our dialogue system can provide guided instruction to students, building upon the foundation of error analysis.
DEPLOYMENT AND APPLICATION
We illustrate the student interface post-VATE deployment in Figure 5, highlighting the key components such as the dialogue box, problem and solution explanation, draft, and student's answer. After each learning session, we can track the student's study time and mastery of each knowledge point and compare their time usage with that of students nationwide (Figure 6). Note that while the deployed environment predominantly features Chinese content, we have translated the interface into English for easier review in Figure 6.
[IMAGE OMITTED. SEE PDF]
[IMAGE OMITTED. SEE PDF]
The VATE system is flexible with different large backbone models and highly efficient in the deployed products. When it is powered by GPT-4o, the average number of input tokens is 1865, the average number of output tokens is 81, and the average response time is 1955 ms. As of now, the VATE system has achieved widespread deployment, covering 35 provinces and 338 cities, significantly impacting the analysis and guidance of elementary school students' mathematical errors. With over 100,000 usage records, this system has become an integral tool for educators, demonstrating its extensive reach and effectiveness in addressing student learning challenges across a vast geographical area.
For certain lightweight, low-reasoning tasks, referred to as snap-to-solve, the use of large-scale API-based LLMs to generate reference answers can introduce unnecessary computational overhead and network latency, negatively affecting user experience. These tasks typically include simple math questions, short-response draft analyses (e.g., OCR text length less than 20 characters), casual conversations, story generation, and other applications that do not require deep reasoning. Since these tasks are relatively straightforward, they do not benefit significantly from the reasoning capabilities of cloud-based LLMs. Instead, we deploy Google's open-source large model, Gemma-3n-4b, directly on edge devices to handle these tasks locally. This setup not only reduces latency and improves responsiveness but also allows limited offline functionality in scenarios with poor or no internet access, greatly enhancing usability in rural or bandwidth-constrained environments. Figure 7 illustrates the user interaction interface for these edge-side tasks.
[IMAGE OMITTED. SEE PDF]
EXPERIMENTS
Offline experiment: human evaluation of error analysis capabilities
To validate the effectiveness of our VATE system's error analysis capabilities, we conducted an experimental analysis on a selection of representative and commonly mistaken question types and categories from over usage samples. The data selection logic was as follows: (1) Questions that students repeatedly answered incorrectly, meaning the question was re-recommended to the student, and they continued to make errors; (2) Questions with a relatively high average error rate, specifically those ranking high in overall incorrect response rates. We applied a weighted ranking based on criteria (1) and (2) to select the top 420 questions.
These selected sample questions, along with the error analysis and recommendations generated by VATE, were then evaluated by experts who rated them on a binary scale, with 1 indicating success and 0 indicating suboptimal performance. Ultimately, of the samples passed expert review, and the generated recommendations were also well-received by the experts. Although 78.3% may not appear to completely resolve issues related to error cause analysis, it's important to note that our test data had been cleaned, meaning we selected the most challenging problems for evaluation. Moreover, the remaining 21.7% of the samples still provided some reference value, even if they did not meet the accuracy standards set by expert review. This demonstrates that our system is already suitable for daily use and has achieved notable success on the Squirrel AI platform.
In Figure 8, we present one example each of a successful and a failed case. We observed that inaccuracies in error cause analysis were sometimes due to the multimodal large model misinterpreting the draft, while in other instances, the LLM struggled to summarize or infer the fundamental reason behind the error. Therefore, improvements in both models could significantly enhance the quality of our VATE system, indicating considerable potential for further development.
[IMAGE OMITTED. SEE PDF]
Online experiment: impact of VATE on student learning outcomes
To validate the impact of VATE on students' abilities, we conducted a statistical analysis of learning data from 5600 students in Squirrel AI, categorized by different metrics. We classified the data based on whether students used our VATE, and whether their subsequent attempts on the same question were correct after interacting with the system. Specific metrics are detailed in Table 1, and the results are presented in Table 2.
TABLE 1 Metric abbreviations and definitions.
| Abbreviation | Definition |
| NIACT | The cumulative Number of Incorrect Answers per Concept Tag associated with the current question during this learning session. |
| NQCT | The cumulative Number of Questions in the Concept Tag associated with the current question during this learning session. |
| ARCT | The average Answer Rate for the Concept Tag associated with the current question during this learning session. |
| NVRS | The Number of Video Rewatches and Studies for the knowledge point associated with the current question due to low mastery during this learning session. |
TABLE 2 Impact of VATE on learning outcomes.
| Conversation | Effective | NIACT | NQCT | ARCT | NVRS |
| 4.96 | 9.99 | 0.52 | 0.39 | ||
| 4.18 | 9.59 | 0.59 | 0.30 | ||
| 3.12 | 7.39 | 0.62 | 0.15 |
Overall, after using our system, even if students did not immediately learn the correct answer (not effective), their error rate on related knowledge points decreased by 15%. This indicates that our prompts and suggestions significantly enhance students' understanding of the relevant concepts. Additionally, students' learning efficiency (NQCT) and problem-solving accuracy (ARCT) showed noticeable improvement. When students engaged in highly effective communication with the system, the error rate on relevant knowledge points dropped by 37% compared to those who did not use our system, with further improvements in learning efficiency and accuracy. Moreover, the NVRS, which measures the number of times students needed to rewatch videos to grasp related knowledge, decreased by 61.5%.
The Impact of Dialogue Quality on Learning Outcomes In practical applications, we observed a significant amount of ineffective dialogue. Some conversations were very brief, consisting of only a few characters and lacking clarity, while others contained excessive characters, often resulting from random input by students. We defined moderate dialogue as having a total character count between 15 and 120. Moderate dialogue accounted for only 61.67% of all dialogue content. Table 3 shows the impact of three types of conversation on the number of times students need to engage in repeated learning of a single knowledge point. In our educational system, each knowledge point is associated with various videos and exercises. Learning only ceases when students have repeatedly studied a knowledge point and their accuracy in solving problems reaches a certain level. The more times students need to repeat their learning, the lower their learning efficiency. As shown, even a small amount of dialogue leads to improved learning efficiency, while moderate interaction maximizes it. Beyond this point, additional dialogue contributes little to further improvements. In the optimal scenario, learning efficiency can be increased by up to 40.5%.
TABLE 3 Impact of the VATE platform on foundational learning sessions.
| Conversation | Effective | Conversation quality | Average |
| No dialogue | 9.81 | ||
| Too short | 7.19 | ||
| Too short | 6.23 | ||
| Moderate | 6.54 | ||
| Moderate | 5.85 | ||
| Too long | 6.46 | ||
| Too long | 5.83 |
Ablation study
To assess the impact of various elements on overall performance, we conducted an ablation study by systematically removing one element from the prompt at a time—Draft, Problem Content, Correct Solution, or Student Answer. We then measured the resulting win rate. The win rate is defined as the proportion of cases where the output with all elements included outperforms the output with one element removed. A senior education expert conducted the comparison, ensuring a rigorous evaluation of the results. The study included 418 records, and the reported win rates represent the average across these cases.
We present the result in Table 4. Student drafts offer direct insight into their thought processes, enabling the LLM to trace the causes of errors. Without these drafts, the LLM's ability to accurately diagnose the cause of errors is significantly hindered. The problem content provides the necessary context that frames the LLM's reasoning, allowing it to interpret the task correctly. The correct solution, on the other hand, serves as a critical reference point for identifying deviations in student responses. Finally, students' wrong answers are pivotal for error analysis, as they reveal specific misconceptions or gaps in understanding that need to be addressed. Each element plays a unique and indispensable role in the LLM's comprehensive error analysis.
TABLE 4 Ablation study: lower win rate means greater performance drop when removing the element.
| w/o | Draft | Problem | Solution | Answer |
| Win Rate | 0.61 | 0.67 | 0.77 | 0.70 |
How to choose an MLLM for draft analysis?
In practical applications we observe that the primary bottleneck for current MLLMs lies in recognition accuracy—namely, whether the model can correctly parse both the question and the student's answer. State-of-the-art reasoning models already demonstrate PhD-level performance in K-12 problem comprehension and analysis; consequently, once the visual inputs are accurately recognized, these models routinely generate high-quality solutions. Therefore, when selecting models for industrial multimodal projects, we emphasize evaluating their recognition capabilities rather than their reasoning depth. To quantify the impact of foundation models on error-cause diagnosis accuracy—particularly recent “thinking” models (Guo et al. 2025) such as Skywork-R1V (Peng et al. 2025) and Gemini-2.5-Pro (Comanici et al. 2025)—we conduct targeted experiments that isolate the recognition stage.
- Metric. We measure the recognition accuracy of questions & answers in photo-based question answering. Because the dataset is intentionally small (), we manually curate a representative subset to compare thinking versus non-thinking models.
- Sample. We select 92 items that vary in difficulty, grade level, and image clarity. We do not directly compare error-cause recognition accuracy; empirically, when a model correctly identifies or solves a problem, it can almost always infer the student's underlying misconception. Thus, our evaluation compares only the recognition accuracy of several baselines; the results are reported below.
From the results in Table 5, Gemini achieves approximately 26% higher accuracy than the other models. Compared to non-thinking models (e.g., GPT-4o), thinking models demonstrate a clear advantage in problem-solving capabilities.
TABLE 5 How to Choose an MLLM for draft analysis (case sampled by grade level)?
| Grade | GPT-4o accuracy | Sky_R1 accuracy | Gemini accuracy | Sample size (N=93) |
| 1 | 42.9% | 28.6% | 57.1% | 7 |
| 2 | 64.3% | 64.3% | 92.9% | 14 |
| 3 | 77.8% | 77.8% | 88.9% | 9 |
| 4 | 75.0% | 87.5% | 100.0% | 8 |
| 5 | 63.6% | 63.6% | 100.0% | 11 |
| 6 | 44.4% | 66.7% | 88.9% | 18 |
| 7 | 37.5% | 55.6% | 88.9% | 9 |
| 8 | 27.3% | 36.4% | 100.0% | 12 |
| 9 | 80.0% | 80.0% | 80.0% | 5 |
| Average | 57.0% | 62.3% | 88.5% | 10 |
Additionally, we observe several interesting patterns in model performance across different problem types:
- All models perform well on word problems and clear arithmetic exercises.
- Both Sky R1 and Gemini successfully handle problems requiring moderate reasoning and those with slightly unclear symbols.
- Only Gemini demonstrates proficiency with problems containing easily confused symbols or digits, as well as geometric problems, including three-dimensional geometry.
Case study
In Figure 9, we present a real-world example of an error analysis dialogue. Before the dialogue, the agent conducted the error analysis as outlined in Figure 3 and successfully identified that the error was caused by a calculation oversight. The dialogue begins with an open-ended question posed by the agent, designed to encourage the student's independent problem-solving and critical thinking. After that, the agent leads the discussion by methodically solving the problems step by step. With the agent's assistance, the student learns to break down the problem of “calculating days between dates” into calculating the days within two separate months and finally derive the correct answer. At the conclusion of the dialogue, the agent encourages the student to be more attentive in future calculations, highlighting that the error was due to a simple oversight. This case shows that the error analysis dialogue provides a more user-friendly protocol between the system and students, facilitating easier learning from mistakes.
[IMAGE OMITTED. SEE PDF]
Survey on sales personnel satisfaction
To gain a better understanding of user experiences, we conducted a satisfaction survey among four types of personnel at Squirrel AI: store managers (14.71%), dealers (23.53%), supervisors (52.94%), and salespeople (8.82%)2. The results in Figure 10 show that nearly 60% of respondents believe our virtual AI teacher effectively prevents uncivil learning behaviors (e.g., students inputting gibberish or engaging in personal attacks) and facilitates effective communication based on drafts. Over 66% of participants acknowledged the scalability of the AI across different disciplines and grade levels. The effectiveness of the system's error correction was rated at 8.58 out of 10, with an overall score of 8.36. Sales personnel gave a recommendation score of 8.47 out of 10, highlighting the significant impact of Squirrel AI on educational efficiency and product upgrades.
[IMAGE OMITTED. SEE PDF]
CONCLUSION
In conclusion, the proposed virtual AI teacher system (VATE) represents a significant advancement in educational technology by effectively addressing the limitations of traditional error correction methods. By integrating multimodal data such as student drafts and utilizing dual-stream large language models, VATE not only enhances the accuracy of error cause analysis but also provides targeted instructional guidance. This approach enables a scalable, cost-effective, and flexible educational process that can be applied across various subjects and grade levels. The implementation of an error pool further optimizes the system's efficiency, reducing computational demands while maintaining high accuracy in error detection and feedback. Empirical results demonstrate that the VATE system significantly improves student learning outcomes, as evidenced by increased mastery of knowledge points and high satisfaction ratings. The success of VATE underscores the potential of AI-driven solutions in revolutionizing education, offering a model that can be adapted and expanded to meet diverse educational needs.
FUTURE WORK
Building on the foundation established by error analysis, our future work will explore the development of a learning content recommendation system that leverages these insights to create more advanced adaptive learning systems (Wen et al. 2024; Li et al. 2024a). Error analysis enables recommendation systems to become more targeted and goal-oriented, allowing for tailored interventions. For example, the system could suggest specific exercises to address non-knowledge-related errors, such as calculation mistakes or misunderstandings, while generating personalized learning paths for knowledge-based errors to enhance mastery of specific concepts. We plan to present this next step in our research in a follow-up paper, focusing on how integrating error analysis into recommendation systems can optimize learning outcomes and further improve student engagement and success.
CONFLICT OF INTEREST STATEMENT
The authors declare no conflicts of interest.
Brown, J., and K. Skow, Center IRIS. 2016. “Mathematics: identifying and addressing student errors.” The Iris Center, 31.
Brown, J. S., and R. R. Burton. 1978. “Diagnostic Models for Procedural Bugs in Basic Mathematical Skills.” Cognitive Science 20, no. 2: 155–192.
Brown, T., B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. 2020. “Language models are few‐shot learners.” In Advances in Neural Information Processing Systems (NeurIPS 2020), 33, 1877–1901. [Online]. Available: https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a‐Abstract.html
Chiang, W.‐L., Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y. Zhuang, J. E. Gonzalez, I. Stoica, and E. P. Xing. 2023. “Vicuna: An open‐source chatbot impressing GPT‐4 with 90%* ChatGPT quality.” *arXiv preprint* arXiv:2309.11998. [Online]. Available: https://vicuna.lmsys.org/
Comanici, G., E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosen, et al. 2025. “Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.” arXiv preprint arXiv:2507.06261. [Online]. Available: https://arxiv.org/abs/2507.06261
Feldman, M. Q., J. Y. Cho, M. Ong, S. Gulwani, Z. Popović, and E. Andersen. 2018. “Automatic diagnosis of students' misconceptions in K–8 mathematics.” In Proc. of the 2018 CHI Conference on Human Factors in Computing Systems (CHI '18), 1–12. https://doi.org/10.1145/3173574.3174016
Fu, C., Y.‐F. Zhang, S. Yin, B. Li, X. Fang, S. Zhao, H. Duan, X. Sun, Z. Liu, L. Wang, et al. 2024. “MME‐Survey: A comprehensive survey on evaluation of multimodal LLMs.” arXiv preprint arXiv:2411.15296. [Online]. Available: https://arxiv.org/abs/2411.15296
Guo, D., D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al. 2025. “DeepSeek‐R1: Incentivizing reasoning capability in LLMs via reinforcement learning.” arXiv preprint arXiv:2501.12948. [Online]. Available: https://arxiv.org/abs/2501.12948
Haryanti, M. D., T. Herman, and S. Prabawanto. 2019. “Analysis of students' error in solving mathematical word problems in geometry.” In Journal of Physics: Conference Series. 1157, 042084, IOP Publishing. https://doi.org/10.1088/1742‐6596/1157/4/042084
Jarvis, M. P., G. Nuzzo‐Jones, and N. T. Heffernan. 2004. “Applying Machine Learning Techniques to Rule Generation in Intelligent Tutoring Systems.” In International Conference on Intelligent Tutoring Systems, 541–553. Springer: Maceio, Brazil.
Jiang, A. Q., A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, et al. 2023. “Mistral 7B.” arXiv preprint arXiv:2310.06825. [Online]. Available: https://arxiv.org/abs/2310.06825
Li, H., T. Xu, C. Zhang, E. Chen, J. Liang, X. Fan, H. Li, J. Tang, and Q. Wen. 2024. “Bringing generative AI to adaptive learning in education.” arXiv preprint arXiv:2402.14601. [Online]. Available: https://arxiv.org/abs/2402.14601
Li, X., W. Wang, M. Li, J. Guo, Y. Zhang, and F. Feng. 2024. “Evaluating mathematical reasoning of large language models: A focus on error identification and correction.” arXiv preprint arXiv:2406.00755. [Online]. Available: https://arxiv.org/abs/2406.00755
OpenAI 2023. “GPT‐4 Technical Report.” arXiv preprint arXiv:2303.08774. [Online]. Available: https://arxiv.org/abs/2303.08774
Peng, Y., P. Wang, X. Wang, Y. Wei, J. Pei, W. Qiu, A. Jian, Y. Hao, J. Pan, T. Xie, L. Ge, R. Zhuang, X. Song, Y. Liu, and Y. Zhou. 2025. “Skywork R1V: Pioneering multimodal reasoning with chain‐of‐thought.” arXiv preprint arXiv:2504.05599. [Online]. Available: https://arxiv.org/abs/2504.05599
Priyani, H. A., and R. Ekawati. 2018. “Error analysis of mathematical problems on TIMSS: A case of Indonesian secondary students.” In IOP Conference Series: Materials Science and Engineering. 296, 012010, IOP Publishing. https://doi.org/10.1088/1757‐899X/296/1/012010
Roy, S., and D. Roth. 2018. “Mapping to declarative knowledge for word problem solving.” Transactions of the Association for Computational Linguistics 6: 159–172. https://doi.org/10.1162/tacl_a_00010
Touvron, H., T. Lavril, G. Izacard, X. Martinet, M.‐A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, and F. Azhar, et al. 2023. “LLaMA: Open and efficient foundation language models.” arXiv preprint arXiv:2302.13971. [Online]. Available: https://arxiv.org/abs/2302.13971
Wang, S., T. Xu, H. Li, C. Zhang, J. Liang, J. Tang, P. S. Yu, and Q. Wen. 2024. “Large language models for education: A survey and outlook.” arXiv preprint arXiv:2403.18105. [Online]. Available: https://arxiv.org/abs/2403.18105
Wang, Z., S. Tschiatschek, S. Woodhead, J. M. Hernández‐Lobato, S. P. Jones, R. G. Baraniuk, and C. Zhang. 2021. “Educational question mining at scale: Prediction, analysis and personalization.” In Proc. of the AAAI Conference on Artificial Intelligence (AAAI '21), 35, no. 15, 15669–15677. https://doi.org/10.1609/aaai.v35i15.17619
Wen, Q., J. Liang, C. Sierra, R. Luckin, R. Tong, Z. Liu, P. Cui, and J. Tang. 2024. “AI for Education (AI4EDU): Advancing personalized education with LLM and adaptive learning.” In Proc. of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD '24), 6743–6744. https://doi.org/10.1145/3637528.3671536
Xu, T., R. Tong, J. Liang, X. Fan, H. Li, and Q. Wen. 2024. “Foundation Models for Education: Promises and Prospects.” IEEE Intelligent Systems 390, no. 3: 20–24.
Yin, S., C. Fu, S. Zhao, K. Li, X. Sun, T. Xu, and E. Chen. 2023. “A survey on multimodal large language models.” arXiv preprint arXiv:2306.13549. [Online]. Available: https://arxiv.org/abs/2306.13549
Zhang, Y.‐F., Q. Wen, C. Fu, X. Wang, Z. Zhang, L. Wang, and R. Jin. 2024. “Beyond LLaVA‐HD: Diving into high‐resolution large multimodal models.” arXiv preprint arXiv:2406.08487. [Online]. Available: https://arxiv.org/abs/2406.08487
Zhang, Y.‐F., H. Zhang, H. Tian, C. Fu, S. Zhang, J. Wu, F. Li, K. Wang, Q. Wen, Z. Zhang, et al. 2024. “MME‐RealWorld: Could your multimodal LLM challenge high‐resolution real‐world scenarios that are difficult for humans?” arXiv preprint arXiv:2408.13257. [Online]. Available: https://arxiv.org/abs/2408.13257
Zhou, A., K. Wang, Z. Lu, W. Shi, S. Luo, Z. Qin, S. Lu, A. Jia, L. Song, M. Zhan, et al. 2024. “Solving challenging math word problems using GPT‐4 Code Interpreter with code‐based self‐verification.” In Proc. of the Twelfth International Conference on Learning Representations (ICLR). [Online]. Available: https://openreview.net/forum?id=YOUR‐PAPER‐ID
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2025. This work is published under http://creativecommons.org/licenses/by-nc/4.0/ (the "License"). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
This paper extends our previously published work on the virtual AI teacher (VATE) system, presented at IAAI‐25. VATE is designed to autonomously analyze and correct student errors in mathematical problem‐solving using advanced large language models (LLMs). By incorporating student draft images as a primary input for reasoning, the system provides fine‐grained error cause analysis and supports real‐time, multi‐round AI—student dialogues. In this extended version, we introduce a new snap‐to‐solve module for handling low‐reasoning tasks using edge‐deployed LLMs, enabling faster and partially offline interaction. We also include expanded benchmarking experiments, including human expert evaluations and ablation studies, to assess model performance and learning outcomes. Deployed on the Squirrel AI platform, VATE demonstrates high accuracy (78.3%) in error analysis and improves student learning efficiency, with strong user satisfaction. These results suggest that VATE is a scalable, cost‐effective solution with the potential to transform educational practices.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Details
1 Squirrel AI Learning, Shanghai, China
2 NLPR, CASIA MAIS, Shanghai, China





