Character recognition of Tibetan Historical document in Uchen font: Dataset and bench mark - Citegraph

Paper Info

Title
Character recognition of Tibetan Historical document in Uchen font: Dataset and bench mark

Abstract
A offline character dataset of Tibetan Historical document in Uchen font, THCU, is presented to facilitate the research of Tibetan Historical document recognition. The dataset THCU includes two subsets: THCU-M and THCU-S. The THCU-M is annotated manually in original document images, including 121214 character samples and 238 character categories. The subset THCU-S is a simulation dataset, and its samples are generated based on the idea of component combination. There are four subsets in THCU-S, in which the numbers of character category are 7238, 2908, 562 and 245 respectively, and the numbers of sample in each category are 5000, 3000, 600 and 600 respectively. We also evaluate THCU dataset using a CNN based model as a baseline performance. The experiment shows that the performance of the model on the real data is greatly improved by adding the generated samples.

Year	DOI	Venue
2022	10.3233/JCM-226167	JOURNAL OF COMPUTATIONAL METHODS IN SCIENCES AND ENGINEERING
Keywords	DocType	Volume
Tibetan Historical document, character recognition, dataset, sample generation	Journal	22
Issue	ISSN	Citations
5	1472-7978	0
PageRank	References	Authors
0.34	0	4

Authors (4 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Zhenjiang Li	1	0	0.34
Weilan Wang	2	9	11.75
Wang Yiqun	3	226	17.63
Qianxue Zhang	4	0	0.34

1