Title
SAM: Self Attention Mechanism for Scene Text Recognition Based on Swin Transformer
Abstract
Scene text recognition, which detects and recognizes the text in the image, has engaged extensive research interest. Attention mechanism based methods for scene text recognition have achieved competitive performance. For scene text recognition, the attention mechanism is usually combined with RNN structures as a module to predict the results. However, RNN attention-based methods are sometimes hard to converge on account of gradient vanishing/exploding during training, and RNN cannot be computed in parallel. To remedy this issue, we propose a Swin Transformer-based encoder-decoder mechanism, which relies entirely on the self attention mechanism (SAM) and can be computed in parallel. SAM is an efficient text recognizer that is only formed by two components: 1) an encoder based on Swin Transformer that gets the visual information of input image, and 2) a Transformer-based decoder gets the final results by applying self attention to the output of encoder. Considering that the scale of scene text has a large variation in images, we apply the Swin Transformer to compute the visual features with shifted windows, which permits self attention computation to cross-window connections and limits for non-overlapping local window. Our method has improved in accuracy over previous methods at ICDAR2003, ICDAR2013, SVT, SVT-P, CUTE and ICDAR2015 by 0.9%, 3.2%, 0.8%, 1.3%, 1.7%, 1.1% respectively. Especially, our method achieved the fastest predict time of 0.02s per image.
Year
DOI
Venue
2022
10.1007/978-3-030-98358-1_35
MULTIMEDIA MODELING (MMM 2022), PT I
Keywords
DocType
Volume
Scene text recognition, Swin transformer, Attention
Conference
13141
ISSN
Citations 
PageRank 
0302-9743
0
0.34
References 
Authors
0
5
Name
Order
Citations
PageRank
Xiang Shuai100.34
Xiao Wang226.24
Wei Wang300.34
Xin Yuan4108992.27
Xin Xu51365100.22