Title
Language-guided Multi-Modal Fusion for Video Action Recognition
Abstract
A recent study [30] has found that training a multi-modal network often produces a network that has not learned the proper parameters for video action recognition. These multi-modal network models perform normally during training but fall short to its single modality counterpart when testing. The main cause for this performance drop could be two-fold. First, conventional methods use a poor fusion mechanism, where each modality is trained separately and then simply combine together (e.g., late feature fusion). Second, collecting videos is much more expensive than images. The insufficient video data can hardly provide support for training a multi-modal network that has a larger and more complex weight space. In this paper, we proposed the Language-guided Multi-Modal Fusion to address the above poor fusion problem. A sophisticatedly designed bi-modal video encoder is used to fuse audio and visual signal to generate a finer video representation. To ensure the over-fitting can be avoid, we use a language-guided contrastive learning to largely augment the video data to support the learning of multi-modal network. On a large-scale benchmark video dataset, the proposed method successfully elevates the accuracy of video action recognition.
Year
DOI
Venue
2021
10.1109/ICCVW54120.2021.00354
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2021)
DocType
Volume
Issue
Conference
2021
1
ISSN
Citations 
PageRank 
2473-9936
0
0.34
References 
Authors
8
3
Name
Order
Citations
PageRank
Jenhao Hsiao121.42
Yikang Li200.34
Chiuman Ho321.42