Title
Emotion Recognition With Multimodal Transformer Fusion Framework Based on Acoustic and Lexical Information
Abstract
People usually express emotions through paralinguistic and linguistic information in speech. How to effectively integrate linguistic and paralinguistic information for emotion recognition is a challenge. Previous studies have adopted the bidirectional long short-term memory (BLSTM) network to extract acoustic and lexical representations followed by a concatenate layer, and this has become a common method. However, the interaction and influence between different modalities are difficult to promote using simple feature fusion for each sentence. In this article, we propose an implicitly aligned multimodal transformer fusion (IA-MMTF) framework based on acoustic features and text information. This model enables the two modalities to guide and complement each other when learning emotional representations. Thereafter, the weighed fusion is used to control the contributions of different modalities. Thus, we can obtain more complementary emotional representations. Experiments on the interactive emotional dyadic motion capture (IEMOCAP) database and multimodal emotionlines dataset (MELD) show that the proposed method outperforms the baseline BLSTM-based method.
Year
DOI
Venue
2022
10.1109/MMUL.2022.3161411
IEEE MultiMedia
Keywords
DocType
Volume
linguistic information,paralinguistic information,emotion recognition,short-term memory network,acoustic representations,lexical representations,concatenate layer,simple feature fusion,implicitly aligned multimodal transformer fusion framework,acoustic features,text information,weighed fusion,complementary emotional representations,multimodal emotionlines dataset,baseline BLSTM-based method,acoustic lexical information
Journal
29
Issue
ISSN
Citations 
2
1070-986X
0
PageRank 
References 
Authors
0.34
6
6
Name
Order
Citations
PageRank
Lili Guo100.34
Longbiao Wang227244.38
Jianwu Dang301.69
Yahui Fu401.01
Jia-Xing Liu515.08
Shifei Ding6107494.63