Title
A Sequence-selective Fine-grained Image Recognition Strategy Using Vision Transformer
Abstract
Aiming at precise sub-category classification of images, fine-grained image recognition requires the algorithms to enjoy a remarkable ability of subtle feature extraction. Recently, the architecture of Transformer has been successfully applied in vision tasks, bringing a novel approach to improve feature extraction performance of fine-grained image recognition algorithms. However, fine-grained image datasets are usually quite limited in capacity, which are unfavorable for the data-consuming training process of Transformers. In order to increase the available amount of data for training, in this paper we firstly introduce a stochastic image data augmentation method for Vision Transformer (ViT), which uses a Dense-DETR model to extract feature regions and performs random insertion and removal for the transformed patch sequence. To select the most informative sequence elements in the forward propagation pro-cess, we implement a feature patch selection strategy by applying an additional convolutional network structure to ViT encoders. Inspired from active learning, a contrastive loss utilizing the posterior information of paired images is also introduced as a penalty item of ViT's cross-entropy loss objective. Such strategies can make the ViT extract the most discriminative feature information from its input. Extensive experiments have supported that the proposed sequence-selective Vision Transformer reaches the highest recognition accuracies on several frequently-used fine-grained image datasets.
Year
DOI
Venue
2022
10.1109/IST55454.2022.9827667
2022 IEEE International Conference on Imaging Systems and Techniques (IST)
Keywords
DocType
ISSN
fine-grained image recognition,Vision Trans-former,self-attention
Conference
1558-2809
ISBN
Citations 
PageRank 
978-1-6654-8103-8
0
0.34
References 
Authors
2
3
Name
Order
Citations
PageRank
Yulin Cai100.34
Wang H27129.35
Xingzheng Wang300.34