Evaluating the applicability of the transformer-based grammatical error correction system for assessing language accuracy

Abstract

The current study focuses on the use of Grammatical Error Correction (GEC) technology for assessing language accuracy, which has received relatively less attention than complexity and fluency in the context of automated assessment. Adopting a technology-enhanced approach to language assessment, rather than a technology-driven approach, we critically assessed the suitability of the state-of-the-art GEC system for assessing language accuracy in Korean, an understudied language in this regard. We analyzed how reliable this system is quantitatively and what types of error can be generated by this system qualitatively. Also, we used out-of-domain, inclusive data from heritage speakers of Korean, which has never been considered in the development of GEC. Our accuracy analyses show that the system achieves a fairly high accuracy in differentiating between correct and incorrect sentences on our data (F_0.5 = 0.819). However, the system exhibits a tendency to make unnecessary corrections, such as inserting topics/adverbials or correcting particles, while failing to correct ungrammatical ones in some cases. These findings from our mixed-method analyses suggest that language evaluators should recognize the potential for inaccurate assessments when using a GEC system, as its output may be incorrect at this moment, thus highlighting the critical need for digital language assessment literacy.

Full text

Translate

Turn on search term navigation

Introduction

Technological progress has contributed to the advancement of language assessment, providing practical benefits, like automated essay grading. However, it is crucial that we remain aware of the fact that the development and utilization of technology for language assessment calls for a cautious and thoughtful approach. Brunfaut (2023) points out that unfortunately, language assessment has become technology-driven, where technology forms the core of its design, delivery, and scoring, and the assessment’s functionality is entirely dependent on it; this is distinct from the more desirable technology-enhanced approach, where technology serves to improve and support assessment methods while its inherent limitations are recognized. She thus suggests that when employing technologies, we should focus on the underlying construct and possess a solid understanding of the target language for testing purposes. Furthermore, she emphasizes the importance of implementing small-scale testing and expanding testing efforts to encompass a variety of world languages.

Against this backdrop, the current study focuses on the construct of accuracy, which has received relatively less attention than complexity (Hwang, 2025) and fluency (Tavakoli, 2023) in the context of automated assessment. According to the CAF framework (e.g., Skehan, 1998), there are three dimensions of language performance. Accuracy refers to the ability to produce language that accurately reflects the target language, without errors. This construct is often measured by the ratio of error-free sentences. Complexity is the ability to use diverse and sophisticated vocabulary and structures, which can be quantified by the diversity and number of complex units (e.g., lexical diversity, number of subordination), and fluency is the ability to express oneself smoothly and eloquently, which can be measured with the number of words/clauses/sentences per production sample.

The technology of our focus is Grammatical Error Correction (GEC). While generating error-free language is vital for improving text readability and comprehension (Marier et al., 2025), measuring its accuracy has proven difficult until recently. Unlike complexity or fluency, which could be readily calculated by counting (complex) linguistic units, the sheer diversity of natural language errors—involving form, meaning, and context—made automated accuracy assessment challenging, and this challenge necessitated reliance on human manual annotation in research (e.g., Polat & Kim, 2014). However, the advent of advanced GEC systems, capable of automatically suggesting corrections for diverse grammatical errors, has introduced a significant shift (e.g., Jiang, 2025; Munda et al., 2025; Virtanen & Toshevska, 2025; Wei, 2025).

For example, English GEC systems developed by Jiang (2025) and Wei (2025) both demonstrated high grammatical correction accuracy, with Jiang’s system achieving an F1 score of 0.801 and Wei’s system reaching 94.7% to 97.8% accuracy. Notably, these systems utilized the Transformer architecture, a deep learning approach renowned for its exceptional ability to accurately interpret text meaning and effectively capture grammatical and contextual nuances; this powerful capability has led to its extensive use in GEC systems and across various other natural language processing applications. These GEC systems’ successful performance suggests their strong potential to improve text accuracy, making them a valuable resource for educational purposes and writing support (e.g., Barrot, 2021). Importantly to our research purposes, GEC systems also enable the automated measurement of accuracy: A sentence is considered ungrammatical if corrected by the system, and grammatical if left uncorrected. This straightforward binary judgment then allows for easy computation of the number of error-free sentences in a text, thereby improving the efficiency of the overall accuracy assessment process.

More recently, Korean, an understudied language in the context of GEC (see also Marier, 2025), has also shown improvement in its accuracy. The state-of-the-art GEC model for this language is the Korean Automatic Grammatical Error Annotation System (KAGAS; Yoon et al., 2023). This system is built on a Transformer-based model as well, but a Korean-specific one called KoBART (Katsumata & Komachi, 2020), and trained on three different datasets from adult second language learners, adult native speakers, and a social platform for language learners. The system has demonstrated impressive performance on error correction, achieving accuracy rates of 87.34% on adult second language learner data, 93.93% on adult native speaker data, and 87.06% on social platform data.

Despite the improvements in GEC systems, two fundamental issues require careful consideration in their development and application. One issue concerns the fact that the progress in GEC systems could lead some evaluators to embrace a technology-driven approach to assessing accuracy, believing these systems, with their enhanced performance, can reliably replace human judgment. Yet, because these systems primarily rely on deep learning methods that are not clearly explainable and produce various errors, recognizing their limitations is essential, aligning with a technology-enhanced approach.

In fact, GEC systems tend to correct instances that are already error-free (e.g., Maeng et al., 2023). This limitation is reflected in their low precision scores and hence F_0.5 values. Precision as well as recall is a crucial metric used in evaluating the performance of classification systems, which measures how many of the correct classifications/predictions were actually correct. Recall measures how many of the actually correct cases the model correctly classified/predicted. F_0.5 is the standard evaluation metric for GEC systems, which gives more weight to precision to address the issue surrounding overcorrections (false positives). As the first study to investigate the application of large language models for Korean GEC, Maeng et al. (2023) found that both GPT-3.5 and GPT-4 tended to overcorrect sentences. The problem arising from such overcorrections in GEC systems is that they are likely to misidentify grammatically correct sentences as errors and provide unnecessary corrections. This problem has implications on both language assessment and education. It leads to inaccurate evaluations of accuracy. It also risks confusing language evaluators who do not have robust knowledge in the target language as well as language learners, who may become uncertain about their own grammar and attempt to fix sentences that are in fact correct. Maeng et al. (2023) additionally showed GPT-4’s superior performance (F_0.5 = 0.43) compared to GPT-3.5's (F_0.5 = 0.38). However, the accuracy levels for both were relatively modest. This finding underscores the necessity of testing systems specifically designed for GEC, rather than generic large language models, to achieve accurate assessment.

Another issue is the nature of data that is often used for the GEC development. It is expensive to develop a dataset with high-quality annotations from humans with expert linguistic knowledge. For this reason, GEC systems have used a limited number of datasets, relying on either publicly available data or artificially curated data. For example, Yoon et al. (2023) leveraged (a) the Korean Learners’ Corpus (The National Institute of Korean Language, 2020), a large, open-access corpus featuring human-annotated corrections for grammatical errors, and (b) the Native Korean Corpus, which was artificially created by having native Korean speakers dictate grammatically correct sentences that were given orally. However, the fact that a GEC system will be applied to natural data produced by people with varying language backgrounds underscores the pressing need to evaluate its accuracy and examine its error patterns using out-of-domain data that were not part of its development. In addition, the AI era emphasizes the importance of inclusive data collection, where data is sourced from individuals with diverse language learning backgrounds, locations, ages, and so on to ensure that AI-based systems are fair and effective for all users (see also Marier et al., 2025).

To address the issues above, this study examines the KAGAS for its precision, recall, and F_0.5 using a quantitative approach, as well as its errors using a qualitative approach. For such mixed-method analyses, the study uses inclusive data from child and teenage heritage speakers of Korean. As this group of speakers acquires Korean as a minority language, which is usually spoken at home, concurrently with the dominant community language, their language use patterns may diverge from those of both native speakers and second language learners of Korean (Crosthwaite & Jee, 2020). This factor may unveil certain issues that language evaluators need to be aware of when applying a GEC system for their assessment or research. In these novel attempts, this study is expected to offer insights into the behavior of GEC systems and reveal their potential merits and limitations. Ultimately, the study can contribute to improving the performance of GEC systems, while promoting the technology-enhanced approach (see Brunfaut, 2023).

Method

Data

This study gathered handwritten journals from 11 Korean heritage speakers, aged 9 to 13, who were born in an English-speaking country (e.g., USA) or moved there in their childhood. The speakers contributed an average of 9.55 journals each (SD = 8.21). While English dominated their daily communication, they also used Korean to converse with their parents, with Korean teachers at Korean Saturday school, or rarely, with peers who shared their heritage background. The collected written journal data exhibited characteristics of spoken language, including informal tone and a number of grammatical errors, likely due to the fact that their usual mode of communication with other Korean speakers was spoken.

The journal diaries, consisting of 736 sentences, were typed in Microsoft Excel and then manually corrected for grammatical errors by the author, a native speaker of Korean. During this process, 53 sentences were removed due to the presence of multiple English words or unclarity of their meaning, leaving a total of 683 sentences in the final dataset.

Accuracy analysis

Initially, all sentences within the dataset were categorized into “grammatical” and “ungrammatical” groups by both the author and the KAGAS system. The author’s classification was determined by her manual corrections for grammatical errors. Specifically, sentences that were corrected by the author (e.g., (1)) were marked ungrammatical in the “human_annotation” column, whereas sentences that were not corrected (e.g., (2)) were marked grammatical. Concurrently, the same sentences were also entered into the KAGAS, which followed a similar straightforward rule: Sentences that required correction by the system were coded as ungrammatical (e.g., (1)) in the “gec_annotation” column, while those left uncorrected were coded as grammatical (e.g., (2)).

The KAGAS system took into account various error categories, ranging from word insertion and deletion to spacing, word order, spelling, punctuation, shortening, as well as errors involving verbs, adjectives, nouns, particles, enders, modifiers, and conjugations. To assess the accuracy of the KAGAS on our data, we calculated its precision, recall, and F_0.5 in Python using the “scikit-learn” package. We hypothesized that if the system indeed shows overcorrections (see Section “Introduction”), its classification accuracy would be particularly poor for the grammatically correct sentences.

The analysis script and the quantitative data that can reproduce the reported statistical results are available at https://osf.io/32p6m/.

Error analysis

To gain a more nuanced understanding of the KAGAS’s behavior, we undertook a qualitative error analysis, examining cases of overcorrections or false positives (see also Qorib & Ng, 2022). We also provide a further examination of undercorrection or false negative cases. In the “Results” section, we focus on reporting on the most frequent and significant error types, which are expected to offer some useful insights for language assessment.

Results

Accuracy analysis

As presented in Table 1, the KAGAS achieved an F_0.5 score of 0.819 and accuracy of 80.23%, demonstrating a fair overall performance on the out-of-domain dataset comprising journal diaries written by heritage speakers of Korean. Although this score is lower than the scores obtained from the adult second language learner data (87.34%), the adult native speaker data (93.93%), and the social platform data (87.06%), which are reported in Yoon et al. (2023), it appears to be moderately high.

Table 1. Classification accuracy by grammaticality

	Precision	Recall	F_0.5	Support
Grammatical	0.426	0.531	0.444	113
Ungrammatical	0.902	0.858	0.893	570
Weighted average	0.823	0.804	0.819	683

However, a closer examination reveals that the system’s accuracy varied significantly depending on the grammaticality of the sentences, with a notable decline in performance for grammatical sentences (F_0.5 = 0.444) compared to ungrammatical ones (F_0.5 = 0.893). Importantly, the system demonstrated a tendency to overcorrect. As shown in Fig. 1, it incorrectly identified 46.90% (53 instances) of the total 113 grammatical sentences as ungrammatical. In contrast, it made fewer mistakes when identifying ungrammatical sentences, misclassifying 14.21% (81 instances) out of 570 as grammatical.

[See PDF for image]

Fig. 1

Classification Report

Error analysis

Overcorrection

Our error analysis discovered six distinct error categories outlined in Table 2. First, the most frequent errors (69.81%%) involved the KAGAS inserting a subject or topic into the input sentence. In 22 instances, for example, the system added I plus the topic marker, as in (3) in Table 2. Additionally, it sometimes inserted the topic marker only, without a corresponding argument. While this behavior is understandable as a suggestion to identify a topic, it poses a particular issue in journal diary or narrative essay genres because a topic/subject drop is allowed in the case when it is inferable and/or has been already given in context. Omitting a topic/subject, such as I in particular, is actually preferred in journal diary, as in any other topic-prominent languages.

Table 2. Six distinct error categories for overcorrection and their examples

Type	Number of cases (out of 53)	Example
Topic insertion	37	(3) 나는 집에 오자마자 샤워하고 잤다. Na-nun cip-ey o-camaca syawe-ha-ko I-TOP home-LOC come-as.soon.as shower-do ca-ss-ta. sleep-PST-DC ‘I slept as soon as I got back home.’
Adverbial insertion	6	(4) 그래서 기분이 좋았어요. Kulayse kipwun-i coh-ass-e-yo. so feeling-NOM good-PST-DC-POL ‘So I felt good.’
Deletion	1	(5) 선생님은 너무 착하고 재미있으셔요. sensayngnim-un nemwu chakha-ko teacher-TOP very kind-and caymi-iss-usy-e-yo. fun-exist-HON-DC-POL ‘The teacher is very kind and funny.’
Particle correction	2	(6) 거기에서 많은 추억이 생겼다. Keki-ey-se manhun chwuek-i sayngky-ess-ta. there-LOC-LOC many memory-NOM arise-PST-DC ‘We got many cherished memories from that place.’
Failure to provide a suggestion	2	(7) 엄마와 아빠는 테니스하러 나가서 나는 집에서 저녁을 해야 했다. Emma-wa appa-nun theynisu-ha-le naka-se na-nun mom-and dad-TOP tennis-do-CONN go.out-because I-TOP cip-ey-se cenyek-ul hay-ya hay-ss-ta. home-LOC-LOC dinner-ACC do-must do-PST-DC ‘I needed to make dinner because mom and dad went out to play tennis.’
Various other errors	5	(8) 재미없습니다!!! ᅲ Caymi-eps-supni-ta!!! ㅠ fun-not.exist-POL-DC *emoticon for tears* ‘It is not fun.’

Overcorrections are indicated in bold

Second, the KAGAS’s suggestions also involved inserting a connective, as in (4) in Table 2, or a modifier, such as last to modify Friday (11.32%). While its intention seemed to facilitate connections between the target sentence and the previous sentence, these additions were not necessary or desirable.

Third, a single case (1.89%) involved the system recommending the removal of the possessive pronoun our (see (5) in Table 2). The motivation for this suggestion is unclear. Yet, it may be related to previous findings that Korean speakers tend to overuse the first person plural pronouns, such as we, us, and our (Jeong, 2005). The KAGAS might have learned something related to this from its underlying transformer architecture, but this is likely to negatively impact the accuracy of its evaluations.

Fourth, in two instances (3.77%), the KAGAS suggested adding an extra particle to the sentence. As in (6) in Table 2, it suggested adding the location particle -ey before -se. While this is a grammatically correct suggestion, using -ey before -se is not obligatory in Korean, and it is thus considered an overcorrection.

Fifth, the system sometimes failed to generate any suggestions for sentences that are in fact correct (see (7) in Table 2), resulting in blank information being displayed for the proposed corrections (3.77%).

Sixth, the system showed miscellaneous errors (9.43%). For example, it incorrectly suggested adding or removing an emoticon after the sentence (see (8) in Table 2). This irrelevant suggestion highlights a need for improvement in the system’s ability to distinguish between grammatical errors and other stylistic choices.

Undercorrection

The KAGAS also exhibited undercorrection errors, where sentences containing (multiple) grammatical errors were not corrected. Although the reason for this behavior is unclear, a closer look at these errors within 81 sentences (see Table 3) revealed different types of error.

Table 3. Four distinct error categories for undercorrection and their examples

Type	Number of cases (out of 81)	Example
Space insertion	44	(9)너무 [space] 너무 재미있었어요! Nemwu [space] nemwu caymi-iss-ess-e-yo very very fun-exist-PST-DC-POL ‘It was very very fun!’
Particle correction	21	(10)공부는 어려워요. Kongpwu-nun elyewe-yo. studying-TOP hard-POL ‘It is hard to study.’
Verb conjugation correction	5	(11)다시 뛰었어요. Tasi ttwi-ess-e-yo. againrun-PST-DC-POL ‘I ran again.’
Noun correction	4	(12)[…] 아기를 봤습니다. […] Aki-lul pwa-ss-supni-ta. Baby-ACC see-PST-POL-DC ‘[…] I saw the baby.’
Various other errors	7	(13) 죄송해요, [name]. coysong-hay-yo, [name]. apologize-do-POL [name] ‘I am sorry, [name].’

The required corrections are marked in bold

First, the analysis revealed that 44 instances, which accounted for 54.32% of the total, required corrections to the spacing. For instance, (9) in Table 3 shows that the degree word very was used twice in a row without a space in between, likely to emphasize how fun the event was. This type of spacing issue seems relatively straightforward to fix, yet it was not addressed by the KAGAS. Second, 21 instances (25.93%) needed corrections to the (case or adverbial) particles. For example, (10) in Table 3 exhibits a particle drop, which is a typical error exhibited by second language or heritage language learners of Korean. Despite the sentence requiring a topic particle (or a nominative case particle), the system did not suggest any corrections. Third, 5 instances (6.17%) needed corrections to the verb conjugations. In the case of (11) in Table 3, the correct verb form should include the past tense marker -ess as well as the declarative marker -e. Yet, the writer of this sentence used the morpheme -seo instead, which actually means because, and this usage requires correction.

Fourth, 4 instances (4.94%) were related to the use of nouns. Although aki is the correct form for ayki, which is often allowed in casual speech, this was not corrected by the KAGAS, as shown in (12) in Table 3.

Lastly, there were also other miscellaneous errors pertaining to e.g., punctuation, tense agreement, etc. For example, (13) in Table 3 shows the case where the system should have suggested adding a period.

Discussion

To summarize our findings, the KAGAS demonstrated a fairly high level of accuracy in identifying grammatically correct and incorrect sentences on our out-of-domain, inclusive dataset from heritage speakers of Korean (80.23%; F_0.5 = 0.819). However, the system was found to overcorrect and undercorrect sentences occasionally. In discussing these findings in this section, we contextualize them within the CAF framework, address the potential reasons for the findings, highlight the significance of digital language assessment literacy, and compare the findings with those from previous GEC research.

The KAGAS’s decent performance suggests that an advanced GEC system can be a viable option for efficiently assessing large amounts of data. Despite the CAF framework (Skehan, 1998) guiding language evaluators to focus on complexity, accuracy, and fluency, the automated measurement of accuracy has stood out as especially challenging. This difficulty stems from the fact that it requires linguistic expertise in grammar and insights beyond a simple tally of (complex) linguistic units. The use of sophisticated GEC systems, such as the KAGAS, now allows for the efficient acquisition of accuracy data. This development not only represents a significant methodological advancement in language assessment but also bolsters a theoretically-driven approach to the field.

Meanwhile, the KAGAS’s accuracy of 80.23% on our dataset may be deemed slightly lower than its previously reported performance of 87.06–93.93% on different datasets (Yoon et al., 2023). We attribute this gap to the fact that the type of data we used was not taken into consideration during the development stage of the KAGAS. Our dataset had a distinct nature, as it included data from child to teenage heritage speakers of Korean. Furthermore, we found that the KAGAS system was found to overcorrect, making unnecessary additions or replacements of adverbials, particles, topics, and others. This finding is consistent with the claim made by Maeng et al. (2023) that GEC systems have difficulty accurately identifying grammatically correct sentences. This may be because GEC systems are typically trained on datasets that are biased towards sentences with many errors, leading them to prioritize correction over preservation of grammatical sentences. To address this issue, developers may find it beneficial to use datasets that are balanced in terms of grammaticality.

Albeit less frequently, we also found undercorrections. The majority of these errors were related to spacing, particles, verbs, and nouns. One possible way to improve the spacing issue could be to add a final system that double-checks the use of spaces after addressing the core grammatical errors. The finding on noun-related errors, as exemplified by (12) in Table 3, aligns with previous research on Transformer-based GEC systems for English, which have demonstrated a tendency to have difficulties detecting errors not present in their training data (see also Qorib & Ng, 2022). From this perspective, the KAGAS would benefit from data containing a wider range of lexical errors.

The findings discussed above underscore the critical need for digital language assessment literacy when employing a GEC system, as it enables language evaluators to more effectively leverage and interpret its output. Digital language assessment literacy refers to a language evaluator’s thorough comprehension of assessment content, proficiency to implement relevant technology-based testing approaches, and critical awareness of their advantages and disadvantages (see also Rezai et al., 2021). Before using a GEC system for accuracy assessment, evaluators can test a small portion of data to determine if the system of their choice is suitable for their own data. They can also share the accuracy information regarding the use of the system for language assessment with the academic and education community or in their publications so that other researchers, evaluators, and developers can benefit.

Finally, one of this study’s contributions lies in examining the practical application of a GEC system for Korean language accuracy assessment, an area that remains relatively under-researched. Our findings indicate that the KAGAS achieved a much higher classification accuracy (80.23%; F_0.5 = 0.819) than generic large language models, such as GPT-3.5 (F_0.5 = 0.38) and GPT-4 (F_0.5 = 0.43), which were tested for Korean language data in Maeng et al. (2023). The better performance of the KAGAS suggests that large language models are likely unsuitable for precise language accuracy assessment at present, necessitating the use of systems purpose-built for error correction or grammaticality annotation. When comparing our results to those of existing research on English GEC, our system’s performance is on par with Jiang (2025), who reported an F1 score of 0.801, though it is less accurate than Wei’s (2025) system (94.7–97.8%). Given that all compared systems utilize the Transformer architecture and were trained on diverse datasets, the precise reasons for these differing results remain unclear. However, one potential explanation for Wei’s system’s exceptionally high performance could be its evaluation on a very limited dataset of only 20 sentences. Hence, the general performance that we observed from the Korean GEC seems to be comparable to the advanced GEC system for English. This observation aligns with the general conclusion drawn from Marier’s (2025) synthesis that advanced technologies can lead to the development of effective GEC systems even for lower-resource languages.

Conclusion

This study shows that the state-of-the-art Korean GEC system achieves high accuracy, but it still requires improvement to attain human-like performance, as it overcorrects grammatical sentences and fails to correct ungrammatical ones sometimes. For evaluators, digital language assessment literacy should be thus a mandatory requirement when using an automated system, which will enable them to better interpret and make use of the system’s output.

Although our mixed-method analyses on the novel dataset revealed useful insights, we admit that the dataset is limited in size and scope, as it only features one type of language learners. Therefore, evaluating more diverse datasets, including spoken language datasets, is crucial to improve the performance of GEC systems. Furthermore, the KAGAS’s Transformer-based algorithm lacks transparency in its decision-making process. For example, it is not clear when it fails to generate feedback for some sentences. This highlights the need for future research to develop more interpretable GEC systems that provide clear explanations for their output. Language evaluators, drawing on their domain expertise, can contribute to this improvement process and help promote a technology-enhanced approach (Brunfaut, 2023).

Authors’ contributions

Haerim Hwang: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Resources, Validation, Visualization, Writing – original draft, Writing – review & editing.

Funding

Not applicable.

AI statement

Generative AI tools were used for grammar checks and paraphrasing in some parts. All content and ideas are the author’s own.

Data availability

The quantitative dataset analyzed during the current study are available in the OSF repository, https://osf.io/32p6m/. The qualitative data analyzed is not publicly accessible, as it is from journals that contain private information.

Declarations

Ethics approval and consent to participate

The study has received ethics approval from the Institutional Review Board at the University of Hawai‘i where the author did her PhD studies, and obtained the informed consent from each participant.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Abbreviations

ACC

Accusative case particle

CONN

Connector

Declarative sentence type suffix

HON

Honorific marker

LOC

Locative particle

NOM

Nominative case particle

POL

Politeness suffix

PST

Past tense marker

TOP

Topic particle

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

Barrot, JS. Effects of Facebook-based e-portfolio on ESL learners’ writing performance. Language Culture and Curriculum; 2021; 34, 1 pp. 95-111. [DOI: https://dx.doi.org/10.1080/07908318.2020.1745822]

Brunfaut, T. Future challenges and opportunities in language testing and assessment: Basic questions and principles at the forefront. Language Testing; 2023; 40, 1 pp. 15-23. [DOI: https://dx.doi.org/10.1177/0265532222112789]

Crosthwaite, P., & Jee, M. J. (2020).Referential movement in L2 vs. Heritage Korean: A learner corpus study. In J. Ryan & P. Crosthwaite (Eds.), Referring in a Second Language (pp. 75–99). Routledge.

Jeong, KO. The use of the first person plural possessive pronoun Woorie in Korean language. Hankwukekyoyuk [Journal of Korean Language Education]; 2005; 16, 3 pp. 405-422.

Jiang, H. Developing an artificial intelligence model for English grammar correction: A computational linguistics approach. Journal of Computational Methods in Sciences and Engineering; 2025; 25, 4 pp. 2992-3006. [DOI: https://dx.doi.org/10.1177/14727978251318799]

Hwang, H. (2025). Growth of lexical and syntactic complexity, accuracy, and fluency inspoken production of first language and second language children. System, 132, 103695. https://doi.org/10.1016/j.system.2025.103695

Katsumata, S., & Komachi, M. (2020). Stronger baselines for grammatical error correction using a pretrained encoder-decoder model. In K-F. Wong, K. Knight, & H. Wu (Eds.), Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing (pp. 827–832). Association for Computational Linguistics, https://aclanthology.org/2020.aacl-main.83/

Maeng, M., Gu, J., & Kim. S-A. (2023). Effectiveness of ChatGPT in Korean grammatical error correction. In C-R. Huang, Y. Harada, J-B. Kim, S. Chen, Y-Y. Hsu, E. Chersoni, P. A, W. H. Zeng, B. Peng, Y. Li, & J. Li (Eds.), Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation (pp. 464–472). Association for Computational Linguistics, https://aclanthology.org/2023.paclic-1.46/

Marier, SM; Chen, X; Zhu, L; Kong, X. Grammatical error correction for low-resource languages: A review of challenges, strategies, computational and future directions. PeerJ Computer Science; 2025; 11, e3044. [DOI: https://dx.doi.org/10.7717/peerj-cs.3044]

Munda, R. K., Kumar, A., Burman, R. K., Kumar, B., Alam, M. S., & Sinha, A. (2025). Comprehensive framework for analyzing grammar error correction with GPT-3.5. In 2025 International Conference on Pervasive Computational Technologies (ICPCT) (pp. 814−818). IEEE, https://ieeexplore.ieee.org/document/10940563/

Polat, B; Kim, Y. Dynamics of complexity and accuracy: A longitudinal case study of advanced untutored development. Applied Linguistics; 2014; 35, 2 pp. 184-207. [DOI: https://dx.doi.org/10.1093/applin/amt013]

Qorib, M. R., & Ng, H. T. (2022). Grammaticalerror correction: Are we there yet? In N. Calzolari, C-R. Huang, H. Kim, J. Pustejovsky, L. Wanner, K-S. Choi, P-M. Ryu, H-H. Chen, L. Donatelli, H. Ji, S. Kurohashi, P. Paggio, N. Xue, S. Kim, Y. Hahm, Z.He, T. K. Lee, E. Santus, F. Bond, S-H. Na (Eds.), Proceedings of the 29th International Conference on Computational Linguistics (pp. 2794−2800). International Committee on Computational Linguistics, https://aclanthology.org/2022.coling-1.246/

Rezai, A; Alibakhshi, G; Farokhipour, S; Miri, M. A phenomenographic study on language assessment literacy: Hearing from Iranian university teachers. Language Testing in Asia; 2021; 11, 1 [DOI: https://dx.doi.org/10.1186/s40468-021-00142-5] 26.

Skehan, P. (1998). A cognitive approach to language learning. Oxford University Press. The National Institute of Korean Language. (2020). The Sejong Learner Corpus. The National Institute of Korean Language.

Tavakoli, P; Kendon, G; Mazhurnaya, S; Ziomek, A. Assessment of fluency in the test of English for educational purposes. Language Testing; 2023; 40, pp. 607-629. [DOI: https://dx.doi.org/10.1177/02655322231151384]

The National Institute of Korean Language (2020). The Sejong Learner Corpus. The National Institute of Korean Language. https://doi.org/10.1177/02655322231151384

Virtanen, J., & Toshevska, M. (2025). Acomparison of GEC tools for grammatical error correction in English. In 2025 MIPRO 48th ICTand Electronics Convention (pp. 143–148). IEEE. https://doi.org/10.1109/MIPRO65660.2025.11132067

Wei, H. Design and application of English grammar automatic correction system based on machine learning. Procedia Computer Science; 2025; 261, pp. 806-812. [DOI: https://dx.doi.org/10.1016/j.procs.2025.04.408]

Yoon, S., Park, S., Kim, G., Cho, J., Park, K., Kim, G., Seo, M., & Oh, A. (2023). Towards standardizing Korean grammatical error correction: Datasets and annotation. In A. Rogers, J. Boyd-Graber, & N. Okazaki (Eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (pp.6713–6742). Association for Computational Linguistics, https://doi.org/10.18653/v1/2023.acl-long.371

Word count: 4645

Show less

© The Author(s) 2025. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Evaluating the applicability of the transformer-based grammatical error correction system for assessing language accuracy

Content area

Abstract

Full text