Abstract | ||
---|---|---|
Scene text recognition, which detects and recognizes the text in the image, has engaged extensive research interest. Attention mechanism based methods for scene text recognition have achieved competitive performance. For scene text recognition, the attention mechanism is usually combined with RNN structures as a module to predict the results. However, RNN attention-based methods are sometimes hard to converge on account of gradient vanishing/exploding during training, and RNN cannot be computed in parallel. To remedy this issue, we propose a Swin Transformer-based encoder-decoder mechanism, which relies entirely on the self attention mechanism (SAM) and can be computed in parallel. SAM is an efficient text recognizer that is only formed by two components: 1) an encoder based on Swin Transformer that gets the visual information of input image, and 2) a Transformer-based decoder gets the final results by applying self attention to the output of encoder. Considering that the scale of scene text has a large variation in images, we apply the Swin Transformer to compute the visual features with shifted windows, which permits self attention computation to cross-window connections and limits for non-overlapping local window. Our method has improved in accuracy over previous methods at ICDAR2003, ICDAR2013, SVT, SVT-P, CUTE and ICDAR2015 by 0.9%, 3.2%, 0.8%, 1.3%, 1.7%, 1.1% respectively. Especially, our method achieved the fastest predict time of 0.02s per image. |
Year | DOI | Venue |
---|---|---|
2022 | 10.1007/978-3-030-98358-1_35 | MULTIMEDIA MODELING (MMM 2022), PT I |
Keywords | DocType | Volume |
Scene text recognition, Swin transformer, Attention | Conference | 13141 |
ISSN | Citations | PageRank |
0302-9743 | 0 | 0.34 |
References | Authors | |
0 | 5 |