Title
Spatiotemporal Saliency Representation Learning for Video Action Recognition
Abstract
Deep convolutional neural networks (CNNs) have achieved great success in human action recognition, however they are still limited in understanding complex and noisy videos owing to the difficulties of exploiting appearance and motion information. Most existing works have been devoted to designing CNN architectures, which overlook the quality of network inputs that is of great importance. This paper provides an alternative solution of action recognition improvement by focusing on the quality of network inputs. A multi-task video salient object detection approach with object-of-interest segmentation scheme, which takes into account both human and action-relevant cues, is proposed to immunize the input video from background clutter. Further, a simple spatiotemporal residual network architecture is presented, which operates on multiple high-quality inputs for long-term action representation learning. Empirical evaluations on various challenging datasets demonstrate that the proposed framework can perform competitively against state-of-the-art. Besides better performance, learning representations of saliency can help prevent the action recognition model from overfitting and speed up the convergence of training.
Year
DOI
Venue
2022
10.1109/TMM.2021.3066775
IEEE TRANSACTIONS ON MULTIMEDIA
Keywords
DocType
Volume
Object detection, Three-dimensional displays, Spatiotemporal phenomena, Computer architecture, Task analysis, Solid modeling, Noise reduction, Action recognition, high-quality inputs, salient object detection, spatiotemporal CNNs
Journal
24
ISSN
Citations 
PageRank 
1520-9210
0
0.34
References 
Authors
0
3
Name
Order
Citations
PageRank
Yongqiang Kong141.79
Yunhong Wang23816278.50
Annan Li343.08