Content area

Abstract

Thanks to the strong representation capability of the pre-trained language models, supervised grammatical error correction has achieved promising performance. However, traditional model training depends significantly on the large scale of similar distributed samples. The model performance decreases sharply once the distributions of training and testing data are inconsistent. To address this issue, we propose an automatic sampling approach to effectively select high-quality samples from different corpora and filter out irrelevant or harmful ones. Concretely, we first provide a detailed analysis of error type and sentence length distributions on all datasets. Second, our corpus weighting approach is exploited to yield different weights for each sample automatically based on analysis results, thus emphasizing beneficial samples and ignoring the noisy ones. Finally, we enhance typical Seq2Seq and Seq2Edit grammatical error correction models with pre-trained language models and design a model ensemble algorithm for integrating the advantages of heterogeneous models and weighted samples. Experiments on the benchmark datasets demonstrate that the proper utilization of different corpora is extremely helpful in enhancing the accuracy of grammatical error correction. The detailed analysis gains more insights into the effect of different corpus weighting strategies.

Details

1009240
Title
Automatical sampling with heterogeneous corpora for grammatical error correction
Publication title
Volume
11
Issue
1
Pages
25
Publication year
2025
Publication date
Jan 2025
Publisher
Springer Nature B.V.
Place of publication
Heidelberg
Country of publication
Netherlands
ISSN
21994536
e-ISSN
21986053
Source type
Scholarly Journal
Language of publication
English
Document type
Journal Article
Publication history
 
 
Online publication date
2024-11-12
Milestone dates
2024-10-23 (Registration); 2024-05-02 (Received); 2024-10-18 (Accepted)
Publication history
 
 
   First posting date
12 Nov 2024
ProQuest document ID
3127426845
Document URL
https://www.proquest.com/scholarly-journals/automatical-sampling-with-heterogeneous-corpora/docview/3127426845/se-2?accountid=208611
Copyright
Copyright Springer Nature B.V. Jan 2025
Last updated
2025-01-31
Database
ProQuest One Academic