CCTC: A Cross-Sentence Chinese Text Correction Dataset for Native Speakers. - Citegraph

Paper Info

Title
CCTC: A Cross-Sentence Chinese Text Correction Dataset for Native Speakers.

Abstract
The Chinese text correction (CTC) focuses on detecting and correcting Chinese spelling errors and grammatical errors. Most existing datasets of Chinese spelling check (CSC) and Chinese grammatical error correction (GEC) are focused on a single sentence written by Chinese-as-a-second-language (CSL) learners. We find that errors caused by native speakers differ significantly from those produced by non-native speakers. These differences make it inappropriate to use the existing test sets directly to evaluate text correction systems for native speakers. Some errors also require the cross-sentence information to be identified and corrected. In this paper, we propose a cross-sentence Chinese text correction dataset for native speakers. Concretely, we manually annotated 1,500 texts written by native speakers. The dataset consists of 30,811 sentences and more than 1,000,000 Chinese characters. It contains four types of errors: spelling errors, redundant words, missing words, and word ordering errors. We also test some state-of-the-art models on the dataset. The experimental results show that even the model with the best performance is 20 points lower than humans, which indicates that there is still much room for improvement. We hope that the new dataset can fill the gap in cross-sentence text correction for native Chinese speakers.

Year	Venue	DocType
2022	International Conference on Computational Linguistics	Conference
Volume	Citations	PageRank
Proceedings of the 29th International Conference on Computational Linguistics	0	0.34
References	Authors
0	6

Authors (6 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Baoxin Wang	1	0	1.35
Xingyi Duan	2	0	0.68
Dayong Wu	3	7	3.11
Wanxiang Che	4	711	66.39
Zhigang Chen	5	204	34.10
Guoping Hu	6	309	37.32

1