Abstract | ||
---|---|---|
Deep convolutional neural networks (CNNs) have achieved great success in human action recognition, however they are still limited in understanding complex and noisy videos owing to the difficulties of exploiting appearance and motion information. Most existing works have been devoted to designing CNN architectures, which overlook the quality of network inputs that is of great importance. This paper provides an alternative solution of action recognition improvement by focusing on the quality of network inputs. A multi-task video salient object detection approach with object-of-interest segmentation scheme, which takes into account both human and action-relevant cues, is proposed to immunize the input video from background clutter. Further, a simple spatiotemporal residual network architecture is presented, which operates on multiple high-quality inputs for long-term action representation learning. Empirical evaluations on various challenging datasets demonstrate that the proposed framework can perform competitively against state-of-the-art. Besides better performance, learning representations of saliency can help prevent the action recognition model from overfitting and speed up the convergence of training. |
Year | DOI | Venue |
---|---|---|
2022 | 10.1109/TMM.2021.3066775 | IEEE TRANSACTIONS ON MULTIMEDIA |
Keywords | DocType | Volume |
Object detection, Three-dimensional displays, Spatiotemporal phenomena, Computer architecture, Task analysis, Solid modeling, Noise reduction, Action recognition, high-quality inputs, salient object detection, spatiotemporal CNNs | Journal | 24 |
ISSN | Citations | PageRank |
1520-9210 | 0 | 0.34 |
References | Authors | |
0 | 3 |
Name | Order | Citations | PageRank |
---|---|---|---|
Yongqiang Kong | 1 | 4 | 1.79 |
Yunhong Wang | 2 | 3816 | 278.50 |
Annan Li | 3 | 4 | 3.08 |