Title
HIERARCHICAL NETWORK BASED ON THE FUSION OF STATIC AND DYNAMIC FEATURES FOR SPEECH EMOTION RECOGNITION
Abstract
Many studies on automatic speech emotion recognition (SER) have been devoted to extracting meaningful emotional features for generating emotion-relevant representations. However, they generally ignore the complementary learning of static and dynamic features, leading to limited performances. In this paper, we propose a novel hierarchical network called HNSD that can efficiently integrate the static and dynamic features for SER. Specifically, the proposed HNSD framework consists of three different modules. To capture the discriminative features, an effective encoding module is firstly designed to simultaneously encode both static and dynamic features. By taking the obtained features as inputs, the Gated Multi-features Unit (GMU) is conducted to explicitly determine the emotional intermediate representations for frame-level features fusion, instead of directly fusing these acoustic features. In this way, the learned static and dynamic features can jointly and comprehensively generate the unified feature representations. Benefiting from a well-designed attention mechanism, the last classification module is applied to predict the emotional states at the utterance level. Extensive experiments on the IEMOCAP benchmark dataset demonstrate the superiority of our method in comparison with state-of-the-art baselines.
Year
DOI
Venue
2021
10.1109/ICASSP39728.2021.9414540
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021)
Keywords
DocType
Citations 
Speech Emotion Recognition, Static Features, Dynamic Features, Hierarchical Network
Conference
0
PageRank 
References 
Authors
0.34
0
5
Name
Order
Citations
PageRank
Qi Cao101.01
Mixiao Hou201.01
Bingzhi Chen341.75
Zheng Zhang4315.81
Lu Guangming578662.39