Language-guided Multi-Modal Fusion for Video Action Recognition - Citegraph

Paper Info

Title
Language-guided Multi-Modal Fusion for Video Action Recognition

Abstract
A recent study [30] has found that training a multi-modal network often produces a network that has not learned the proper parameters for video action recognition. These multi-modal network models perform normally during training but fall short to its single modality counterpart when testing. The main cause for this performance drop could be two-fold. First, conventional methods use a poor fusion mechanism, where each modality is trained separately and then simply combine together (e.g., late feature fusion). Second, collecting videos is much more expensive than images. The insufficient video data can hardly provide support for training a multi-modal network that has a larger and more complex weight space. In this paper, we proposed the Language-guided Multi-Modal Fusion to address the above poor fusion problem. A sophisticatedly designed bi-modal video encoder is used to fuse audio and visual signal to generate a finer video representation. To ensure the over-fitting can be avoid, we use a language-guided contrastive learning to largely augment the video data to support the learning of multi-modal network. On a large-scale benchmark video dataset, the proposed method successfully elevates the accuracy of video action recognition.

Year	DOI	Venue
2021	10.1109/ICCVW54120.2021.00354	2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2021)
DocType	Volume	Issue
Conference	2021	1
ISSN	Citations	PageRank
2473-9936	0	0.34
References	Authors
8	3

Authors (3 rows)

Cited by (0 rows)

References (8 rows)

Name	Order	Citations	PageRank
Jenhao Hsiao	1	2	1.42
Yikang Li	2	0	0.34
Chiuman Ho	3	2	1.42

1