Abstract | ||
---|---|---|
Time Delay Neural Networks (TDNN)-based methods are widely used in dialect identification. However, in previous work with TDNN application, subtle variant is being neglected in different feature scales. To address this issue, we propose a new architecture, named dynamic multi-scale convolution, which consists of dynamic kernel convolution, local multi-scale learning, and global multi-scale pooling. Dynamic kernel convolution captures features between short-term and long-term context adaptively. Local multi-scale learning, which represents multi-scale features at a granular level, is able to increase the range of receptive fields for convolution operation. Besides, global multi-scale pooling is applied to aggregate features from different bottleneck layers in order to collect information from multiple aspects. The proposed architecture significantly outperforms state-of-the-art system on the AP20-OLR-dialect-task of oriental language recognition (OLR) challenge 2020, with the best average cost performance (Cavg) of 0.067 and the best equal error rate (EER) of 6.52%. Compared with the known best results, our method achieves 9% of Cavg and 45% of EER relative improvement, respectively. Furthermore, the parameters of proposed model are 91% fewer than the best known model. |
Year | DOI | Venue |
---|---|---|
2021 | 10.21437/Interspeech.2021-56 | Interspeech |
DocType | Citations | PageRank |
Conference | 0 | 0.34 |
References | Authors | |
0 | 9 |
Name | Order | Citations | PageRank |
---|---|---|---|
Tianlong Kong | 1 | 0 | 0.34 |
Shouyi Yin | 2 | 1 | 1.71 |
Dawei Zhang | 3 | 0 | 2.37 |
Wang Geng | 4 | 0 | 0.34 |
Xin Wang | 5 | 194 | 53.80 |
Dandan Song | 6 | 150 | 19.44 |
Jinwen Huang | 7 | 0 | 0.34 |
Huiyu Shi | 8 | 0 | 0.34 |
Xiaorui Wang | 9 | 0 | 1.69 |