Title
Character recognition of Tibetan Historical document in Uchen font: Dataset and bench mark
Abstract
A offline character dataset of Tibetan Historical document in Uchen font, THCU, is presented to facilitate the research of Tibetan Historical document recognition. The dataset THCU includes two subsets: THCU-M and THCU-S. The THCU-M is annotated manually in original document images, including 121214 character samples and 238 character categories. The subset THCU-S is a simulation dataset, and its samples are generated based on the idea of component combination. There are four subsets in THCU-S, in which the numbers of character category are 7238, 2908, 562 and 245 respectively, and the numbers of sample in each category are 5000, 3000, 600 and 600 respectively. We also evaluate THCU dataset using a CNN based model as a baseline performance. The experiment shows that the performance of the model on the real data is greatly improved by adding the generated samples.
Year
DOI
Venue
2022
10.3233/JCM-226167
JOURNAL OF COMPUTATIONAL METHODS IN SCIENCES AND ENGINEERING
Keywords
DocType
Volume
Tibetan Historical document, character recognition, dataset, sample generation
Journal
22
Issue
ISSN
Citations 
5
1472-7978
0
PageRank 
References 
Authors
0.34
0
4
Name
Order
Citations
PageRank
Zhenjiang Li100.34
Weilan Wang2911.75
Wang Yiqun322617.63
Qianxue Zhang400.34