Abstract | ||
---|---|---|
Recently, the multimodal emotion recognition has become a hot topic of research, within the affective computing community, due to its robust performances. In this paper, we propose to analyze emotions in an end-to-end manner based on various convolutional neural networks (CNN) architectures and attention mechanisms. Specifically, we develop a new framework that integrates the spatial and temporal attention into a visual 3D-CNN and temporal attention into an audio 2D-CNN in order to capture the intra-modal features characteristics. Further, the system is extended with an audio-video cross-attention fusion approach that effectively exploits the relationship across the two modalities. The proposed method achieves 87.89% of accuracy on RAVDESS dataset. When compared with state-of-the art methods our system demonstrates accuracy gains of more than 1.89%. |
Year | DOI | Venue |
---|---|---|
2022 | 10.1109/IVMSP54334.2022.9816349 | 2022 IEEE 14th Image, Video, and Multidimensional Signal Processing Workshop (IVMSP) |
Keywords | DocType | ISBN |
spatial attention,temporal attention,cross-fusion,emotion recognition | Conference | 978-1-6654-7823-6 |
Citations | PageRank | References |
0 | 0.34 | 8 |
Authors | ||
2 |
Name | Order | Citations | PageRank |
---|---|---|---|
Bogdan Mocanu | 1 | 0 | 0.34 |
Ruxandra Tapu | 2 | 0 | 0.34 |