Title
Dual-stream cross-modality fusion transformer for RGB-D action recognition
Abstract
RGB-D-based action recognition can achieve accurate and robust performance due to rich complementary information, and thus has many application scenarios. However, existing works combine multiple modalities by late fusion or learn multimodal representation with simple feature-level fusion methods, which fail to effectively utilize complementary semantic information and model interactions between unimodal features. In this paper, we design a self-attention-based modal enhancement module (MEM) and a cross-attention-based modal interaction module (MIM) to enhance and fuse RGB and depth features. Moreover, a novel bottleneck excitation feed-forward block (BEF) is proposed to enhance the expression ability of the model with few extra parameters and computational overhead. By integrating these two modules with BEFs, one basic fusion layer of the cross-modality fusion transformer is obtained. We apply the transformer on top of the dual-stream convolutional neural networks (ConvNets) to build a dual-stream cross-modality fusion transformer (DSCMT) for RGB-D action recognition. Extensive experiments on the NTU RGB+D 120, PKU-MMD, and THU-READ datasets verify the effectiveness and superiority of the DSCMT. Furthermore, our DSCMT can still make considerable improvements when changing convolutional backbones or when applied to different multimodal combinations, indicating its universality and scalability. The code is available at https://github.com/liuzwin98/DSCMT.
Year
DOI
Venue
2022
10.1016/j.knosys.2022.109741
Knowledge-Based Systems
Keywords
DocType
Volume
Action recognition,Multimodal fusion,Transformer,ConvNets,Dual-stream
Journal
255
ISSN
Citations 
PageRank 
0950-7051
0
0.34
References 
Authors
0
6
Name
Order
Citations
PageRank
Zhiyu Liu11610.55
jun cheng285169.84
Libo Liu300.34
Ziliang Ren400.34
Qieshi Zhang51310.44
Chengqun Song600.34