Title
Deep learning based multimodal emotion recognition using model-level fusion of audio–visual modalities
Abstract
Emotion identification based on multimodal data (e.g., audio, video, text, etc.) is one of the most demanding and important research fields, with various uses. In this context, this research work has conducted a rigorous exploration of model-level fusion to find out the optimal multimodal model for emotion recognition using audio and video modalities. More specifically, separate novel feature extractor networks for audio and video data are proposed. After that, an optimal multimodal emotion recognition model is created by fusing audio and video features at the model level. The performances of the proposed models are assessed based on two benchmark multimodal datasets namely Ryerson Audio–Visual Database of Emotional Speech and Song (RAVDESS) and Surrey Audio–Visual Expressed Emotion (SAVEE) using various performance metrics. The proposed models achieve high predictive accuracies of 99% and 86% on the SAVEE and RAVDESS datasets, respectively. The effectiveness of the models are also verified by comparing their performances with the existing emotion recognition models. Some case studies are also conducted to explore the model’s ability to capture the variability of emotional states of the speakers in publicly available real-world audio–visual media.
Year
DOI
Venue
2022
10.1016/j.knosys.2022.108580
Knowledge-Based Systems
Keywords
DocType
Volume
Multimodal emotion recognition,Audio features,Video features,Classification,Deep learning
Journal
244
Issue
ISSN
Citations 
12
0950-7051
2
PageRank 
References 
Authors
0.40
0
3
Name
Order
Citations
PageRank
Asif Iqbal Middya120.40
Baibhav Nag220.40
Sarbani Roy321.08