Spatiotemporal Saliency Representation Learning for Video Action Recognition - Citegraph

Paper Info

Title
Spatiotemporal Saliency Representation Learning for Video Action Recognition

Abstract
Deep convolutional neural networks (CNNs) have achieved great success in human action recognition, however they are still limited in understanding complex and noisy videos owing to the difficulties of exploiting appearance and motion information. Most existing works have been devoted to designing CNN architectures, which overlook the quality of network inputs that is of great importance. This paper provides an alternative solution of action recognition improvement by focusing on the quality of network inputs. A multi-task video salient object detection approach with object-of-interest segmentation scheme, which takes into account both human and action-relevant cues, is proposed to immunize the input video from background clutter. Further, a simple spatiotemporal residual network architecture is presented, which operates on multiple high-quality inputs for long-term action representation learning. Empirical evaluations on various challenging datasets demonstrate that the proposed framework can perform competitively against state-of-the-art. Besides better performance, learning representations of saliency can help prevent the action recognition model from overfitting and speed up the convergence of training.

Year	DOI	Venue
2022	10.1109/TMM.2021.3066775	IEEE TRANSACTIONS ON MULTIMEDIA
Keywords	DocType	Volume
Object detection, Three-dimensional displays, Spatiotemporal phenomena, Computer architecture, Task analysis, Solid modeling, Noise reduction, Action recognition, high-quality inputs, salient object detection, spatiotemporal CNNs	Journal	24
ISSN	Citations	PageRank
1520-9210	0	0.34
References	Authors
0	3

Authors (3 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Yongqiang Kong	1	4	1.79
Yunhong Wang	2	3816	278.50
Annan Li	3	4	3.08

1