Title | ||
---|---|---|
SpeechFormer: A Hierarchical Efficient Framework Incorporating the Characteristics of Speech |
Abstract | ||
---|---|---|
Transformer has obtained promising result on cognitive speech signal processing field, which is of interest in various applications ranging from emotion to neurocognitive disorder analysis. However, most works treat speech signal as a whole, leading to the neglect of the pronunciation structure that is unique to speech and reflects the cognitive process. Meanwhile, Transformer has heavy computational burden due to its full attention. In this paper, a hierarchical efficient framework that considers the structural characteristics of speech, called SpeechFormer, is proposed to serve as a general-purpose backbone for cognitive speech signal processing. SpeechFormer consists of frame, phoneme, word and utterance stages in succession to imitate the structural pattern of speech. Moreover, a modified attention is applied in each stage to learn a stage-specific feature. This hierarchical architecture models speech signal by its nature and greatly reduces the computational complexity as well. SpeechFormer is evaluated on speech emotion recognition (IEMOCAP & MELD) and neurocognitive disorder detection (Pitt & DAIC-WOZ) tasks, and the results show that SpeechFormer outperforms the Transformer-based framework while assuring high computational efficiency. Furthermore, our SpeechFormer achieves comparable results to state-of-the-art approaches. |
Year | DOI | Venue |
---|---|---|
2022 | 10.21437/INTERSPEECH.2022-74 | Conference of the International Speech Communication Association (INTERSPEECH) |
DocType | Citations | PageRank |
Conference | 0 | 0.34 |
References | Authors | |
0 | 5 |
Name | Order | Citations | PageRank |
---|---|---|---|
Weidong Chen | 1 | 0 | 0.68 |
Xiaofen Xing | 2 | 0 | 0.68 |
Xiangmin Xu | 3 | 100 | 17.62 |
Jianxin Pang | 4 | 0 | 0.34 |
Lan Du | 5 | 0 | 0.68 |