Video emotion recognition in the wild based on fusion of multimodal features. - Citegraph

Paper Info

Title
Video emotion recognition in the wild based on fusion of multimodal features.

Abstract
In this paper, we present our methods to the Audio-Video Based Emotion Recognition subtask in the 2016 Emotion Recognition in the Wild (EmotiW) Challenge. The task is to predict one of the seven basic emotions for the characters in the video clips extracted from movies or TV shows. In our approach, we explore various multimodal features from audio, facial image and video motion modalities. The audio features contain statistical acoustic features, MFCC Bag-of-Audio-Words and MFCC Fisher Vectors. For image related features, we extract hand-crafted features (LBP-TOP and SPM Dense SIFT) and learned features (CNN features). The improved Dense Trajectory is used as the motion related features. We train SVM, Random Forest and Logistic Regression classifiers for each kind of feature. Among them, MFCC fisher vector is the best acoustic features and the facial CNN feature is the most discriminative feature for emotion recognition. We utilize late fusion to combine different modality features and achieve a 50.76% accuracy on the testing set, which significantly outperforms the baseline test accuracy of 40.47%.

Year	DOI	Venue
2016	10.1145/2993148.2997629	ICMI
Keywords	Field	DocType
Video Emotion Recognition, Multimodal Features, CNN, Late Fusion	Scale-invariant feature transform,Mel-frequency cepstrum,Computer science,Fusion,Emotion classification,Artificial intelligence,Random forest,Discriminative model,Trajectory,Computer vision,Pattern recognition,Support vector machine,Speech recognition	Conference
Citations	PageRank	References
5	0.39	22
Authors
5

Authors (5 rows)

Cited by (5 rows)

References (22 rows)

Name	Order	Citations	PageRank
Shizhe Chen	1	238	21.83
Xinrui Li	2	5	0.73
Qin Jin	3	639	66.86
Shilei Zhang	4	57	9.81
Yong Qin	5	161	42.54

1