Title
CCTC: A Cross-Sentence Chinese Text Correction Dataset for Native Speakers.
Abstract
The Chinese text correction (CTC) focuses on detecting and correcting Chinese spelling errors and grammatical errors. Most existing datasets of Chinese spelling check (CSC) and Chinese grammatical error correction (GEC) are focused on a single sentence written by Chinese-as-a-second-language (CSL) learners. We find that errors caused by native speakers differ significantly from those produced by non-native speakers. These differences make it inappropriate to use the existing test sets directly to evaluate text correction systems for native speakers. Some errors also require the cross-sentence information to be identified and corrected. In this paper, we propose a cross-sentence Chinese text correction dataset for native speakers. Concretely, we manually annotated 1,500 texts written by native speakers. The dataset consists of 30,811 sentences and more than 1,000,000 Chinese characters. It contains four types of errors: spelling errors, redundant words, missing words, and word ordering errors. We also test some state-of-the-art models on the dataset. The experimental results show that even the model with the best performance is 20 points lower than humans, which indicates that there is still much room for improvement. We hope that the new dataset can fill the gap in cross-sentence text correction for native Chinese speakers.
Year
Venue
DocType
2022
International Conference on Computational Linguistics
Conference
Volume
Citations 
PageRank 
Proceedings of the 29th International Conference on Computational Linguistics
0
0.34
References 
Authors
0
6
Name
Order
Citations
PageRank
Baoxin Wang101.35
Xingyi Duan200.68
Dayong Wu373.11
Wanxiang Che471166.39
Zhigang Chen520434.10
Guoping Hu630937.32