Title
Modeling phonetic pattern variability in favor of the creation of robust emotion classifiers for real-life applications
Abstract
The role of automatic emotion recognition from speech is growing continuously because of the accepted importance of reacting to the emotional state of the user in human-computer interaction. Most state-of-the-art emotion recognition methods are based on turn- and frame-level analysis independent from phonetic transcription. Here, we are interested in a phoneme-based classification of the level of arousal in acted and spontaneous emotions. To start, we show that our previously published classification technique which showed high-level results in the Interspeech 2009 Emotion Challenge cannot provide sufficiently good classification in cross-corpora evaluation (a condition close to real-life applications). To prove the robustness of our emotion classification techniques we use cross-corpora evaluation for a simplified two-class problem; namely high and low arousal emotions. We use emotion classes on a phoneme-level for classification. We build our speaker-independent emotion classifier with HMMs, using GMMs-based production probabilities and MFCC features. This classifier performs equally well when using a complete phoneme set, as it does in the case of a reduced set of indicative vowels (7 out of 39 phonemes in the German SAM-PA list). Afterwards we compare emotion classification performance of the technique used in the Emotion Challenge with phoneme-based classification within the same experimental setup. With phoneme-level emotion classes we increase cross-corpora classification performance by about 3.15% absolute (4.69% relative) for models trained on acted emotions (EMO-DB dataset) and evaluated on spontaneous emotions (VAM dataset); within vice versa experimental conditions (trained on VAM, tested on EMO-DB) we obtain 15.43% absolute (23.20% relative) improvement. We show that using phoneme-level emotion classes can improve classification performance even with comparably low speech recognition performance obtained with scant a priori knowledge about the language, implemented as a zero-gram for word-level modeling and a bi-gram for phoneme-level modeling. Finally we compare our results with the state-of-the-art cross-corpora evaluations on the VAM database. For training our models, we use an almost 15 times smaller training set, consisting of 456 utterances (210 low and 246 high arousal emotions) instead of 6820 utterances (4685 high and 2135 low arousal emotions). We are yet able to increase cross-corpora classification performance by about 2.25% absolute (3.22% relative) from UA=69.7% obtained by Zhang et al. to UA=71.95%.
Year
DOI
Venue
2014
10.1016/j.csl.2012.11.003
Computer Speech & Language
Keywords
Field
DocType
real-life application,robust emotion classifier,emotion classification performance,cross-corpora classification performance,classification technique,phonetic pattern variability,phoneme-level emotion class,phoneme-based classification,emotion classification technique,cross-corpora evaluation,spontaneous emotion,low arousal emotion,classification performance,emotion classification
Mel-frequency cepstrum,Phonetic transcription,Computer science,A priori and a posteriori,Emotion classification,Robustness (computer science),Artificial intelligence,Natural language processing,Classifier (linguistics),Low arousal theory,Arousal,Speech recognition,Machine learning
Journal
Volume
Issue
ISSN
28
2
0885-2308
Citations 
PageRank 
References 
18
0.61
50
Authors
4
Name
Order
Citations
PageRank
Bogdan Vlasenko123512.72
Dmytro Prylipko2664.65
Ronald Böck39910.75
Andreas Wendemuth445141.74