Title
Multi-Modal Audio, Video and Physiological Sensor Learning for Continuous Emotion Prediction.
Abstract
The automatic determination of emotional state from multimedia content is an inherently challenging problem with a broad range of applications including biomedical diagnostics, multimedia retrieval, and human computer interfaces. The Audio Video Emotion Challenge (AVEC) 2016 provides a well-defined framework for developing and rigorously evaluating innovative approaches for estimating the arousal and valence states of emotion as a function of time. It presents the opportunity for investigating multimodal solutions that include audio, video, and physiological sensor signals. This paper provides an overview of our AVEC Emotion Challenge system, which uses multi-feature learning and fusion across all available modalities. It includes a number of technical contributions, including the development of novel high- and low-level features for modeling emotion in the audio, video, and physiological channels. Low-level features include modeling arousal in audio with minimal prosodic-based descriptors. High-level features are derived from supervised and unsupervised machine learning approaches based on sparse coding and deep learning. Finally, a state space estimation approach is applied for score fusion that demonstrates the importance of exploiting the time-series nature of the arousal and valence states. The resulting system outperforms the baseline systems [10] on the test evaluation set with an achieved Concordant Correlation Coefficient (CCC) for arousal of 0.770 vs 0.702 (baseline) and for valence of 0.687 vs 0.638. Future work will focus on exploiting the time-varying nature of individual channels in the multi-modal framework.
Year
DOI
Venue
2016
10.1145/2988257.2988264
AVEC@ACM Multimedia
Keywords
Field
DocType
Affective Computing, Emotion Recognition, Speech, Deep Learning, CNN, Sparse Coding, Facial Expression, Challenge
Computer science,Unsupervised learning,Artificial intelligence,Deep learning,Arousal,Computer vision,Neural coding,Speech recognition,Facial expression,Affective computing,State space,Machine learning,Modal
Conference
Citations 
PageRank 
References 
18
0.62
7
Authors
7
Name
Order
Citations
PageRank
Kevin Brady115621.40
Youngjune Gwon227927.58
Pooya Khorrami31186.27
Elizabeth Godoy4834.96
William M. Campbell579970.38
Charlie K. Dagli61408.44
Thomas S. Huang7278152618.42