Abstract | ||
---|---|---|
Video data is inherently multimodal and sequential. Therefore, deep learning models need to aggregate all data modalities while capturing the most relevant spatio-temporal information from a given video. This paper presents a multimodal deep learning framework for video classification using a Residual Attention-based Fusion (RAF) method. Specifically, this framework extracts spatio-temporal features from each modality using residual attention-based bidirectional Long Short-Term Memory and fuses the information using a weighted Support Vector Machine to handle the imbalanced data. Experimental results on a natural disaster video dataset show that our approach improves upon the state-of-the-art by 5% and 8% regarding F1 and MAP metrics, respectively. Most remarkably, our proposed residual attention model reaches a 0.95 F1-score and 0.92 MAP for this dataset. |
Year | DOI | Venue |
---|---|---|
2019 | 10.1109/CVPRW.2019.00064 | 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) |
Keywords | Field | DocType |
imbalanced data,natural disaster video dataset show,residual attention model,video classification,video data,data modalities,relevant spatio-temporal information,multimodal deep learning framework,Residual Attention-based Fusion method,spatio-temporal features,residual attention-based bidirectional Long Short-Term Memory,weighted Support Vector Machine | Computer vision,Residual,Pattern recognition,Computer science,Fusion,Artificial intelligence | Conference |
ISSN | ISBN | Citations |
2160-7508 | 978-1-7281-2507-7 | 0 |
PageRank | References | Authors |
0.34 | 1 | 3 |
Name | Order | Citations | PageRank |
---|---|---|---|
Samira Pouyanfar | 1 | 141 | 13.06 |
Tianyi Wang | 2 | 294 | 27.78 |
Shu-Ching Chen | 3 | 1978 | 182.74 |