Title
Exploring Multimodal Features and Fusion for Time-Continuous Prediction of Emotional Valence and Arousal
Abstract
Advances in machine learning and deep learning make it possible to detect and analyse emotion and sentiment using textual and audio-visual information at increasing levels of effectiveness. Recently, an interest has emerged to also apply these techniques for the assessment of mental health, including the detection of stress and depression. In this paper, we introduce an approach that predicts stress (emotional valence and arousal) in a time-continuous manner from audio-visual recordings, testing the effectiveness of different deep learning techniques and various features. Specifically, apart from adopting popular features (e.g., BERT, BPM, ECG, and VGGFace), we explore the use of new features, both engineered and learned, along different modalities to improve the effectiveness of time-continuous stress prediction: for video, we study the use of ResNet-50 features and the use of body and pose features through OpenPose, whereas for audio, we primarily investigate the use of Integrated Linear Prediction Residual (ILPR) features. The best result we achieved was a combined CCC value of 0.7595 and 0.3379 for the development set and the test set of MuSe-Stress 2021, respectively.
Year
DOI
Venue
2021
10.1007/978-3-030-98404-5_65
Intelligent Human Computer Interaction
Keywords
DocType
Volume
Emotion detection, Excitation source features, Human pose, LP analysis, Multimodal fusion, Multimodal sentiment analysis
Conference
13184
ISSN
Citations 
PageRank 
0302-9743
0
0.34
References 
Authors
0
10