Title
Spatial-Temporal Information Aggregation and Cross-Modality Interactive Learning for RGB-D-Based Human Action Recognition
Abstract
The RGB-D-based human action recognition is gaining increasing attention because the different modalities can provide complementary information. However, the recognition performance is still not satisfactory due to the limited ability to learn spatial-temporal feature and insufficient inter-model interaction. In this paper, we propose a novel approach for RGB-D human action recognition by aggregating spatial-temporal information and implementing cross-modality interactive learning. Firstly, a spatial-temporal information aggregation module (STIAM) is proposed to utilizes sample convolutional neural networks (CNNs) to aggregate the spatial-temporal information in entire RGB-D sequence into lightweight representations efficiently. This allows the model to extract richer spatial-temporal features with limited extra memory and computational cost. Secondly, a cross-modality interactive module (CMIM) is proposed to fully fuse the multi-modal complementary information. Moreover, a multi-modal interactive network (MMINet) is constructed for RGB-D-based action recognition by embeding the above two modules into the two-stream CNNs. In order to verify the universality of our approach, two backbones are deployed in the two-stream architecture, successively. Ablation experiments demonstrate that the proposed STIAM can bring significant improvement in recognizing actions. CMIM can further play the advantages of complementary features of multiple modalities. Extensive experiments on NTU RGB+D 60, NTU RGB+D 120 and PKU-MMD datasets proved the effectiveness of the proposed approach. The proposed approach achieves an accuracy of 94.3% and 96.5% for cross-subject and cross-view on NTU RGB+D 60, 91.7% and 92.6% for cross-subject and cross-setup on NTU RGB+D 120, 93.6% and 94.2% for cross-subject and cross-view on PKU-MMD datasets, which are the state-of-the-art performance. Further analysis denotes that our approach has advantages in recognizing subtle actions.
Year
DOI
Venue
2022
10.1109/ACCESS.2022.3201227
IEEE ACCESS
Keywords
DocType
Volume
Feature extraction, Data mining, Deep learning, Optical flow, Three-dimensional displays, Histograms, Costs, Interactive systems, Learning systems, Human factors, Deep learning, action recognition, cross-modality interactive learning, information aggregation
Journal
10
ISSN
Citations 
PageRank 
2169-3536
0
0.34
References 
Authors
0
5
Name
Order
Citations
PageRank
Qin Cheng100.34
Zhiyu Liu21610.55
Ziliang Ren300.34
jun cheng485169.84
Jianming Liu500.34