Abstract | ||
---|---|---|
Audio classification is an important task in the machine learning field with a wide range of applications. Since the last decade, deep learning based methods have been widely used and the transformer-based models are becoming new paradigm for audio classification. In this paper, we present Spectrogram Transformers, which are a group of transformer-based models for audio classification. Based on the fundamental semantics of audio spectrogram, we design two mechanisms to extract temporal and frequency features from audio spectrogram, named time-dimension sampling and frequency-dimension sampling. These discriminative representations are then enhanced by various combinations of attention block architectures, including Tempo-ral Only (TO) attention, Temporal-Frequency sequential (TFS) attention, Temporal-Frequency Parallel (TFP) attention, and Two-stream Temporal-Frequency (TSTF) attention, to extract the sound record signatures to serve the classification task. Our experiments demonstrate that these Transformer models outper-form the state-of-the-art methods on ESC-50 dataset without pre-training stage. Furthermore, our method also shows great efficiency compared with other leading methods. |
Year | DOI | Venue |
---|---|---|
2022 | 10.1109/IST55454.2022.9827729 | 2022 IEEE International Conference on Imaging Systems and Techniques (IST) |
Keywords | DocType | ISSN |
Transformer,Spectrogram,Audio representation,Audio classification | Conference | 1558-2809 |
ISBN | Citations | PageRank |
978-1-6654-8103-8 | 0 | 0.34 |
References | Authors | |
10 | 4 |
Name | Order | Citations | PageRank |
---|---|---|---|
Yixiao Zhang | 1 | 0 | 0.34 |
Baihua Li | 2 | 176 | 21.71 |
Hui Fang | 3 | 0 | 1.01 |
Qinggang Meng | 4 | 273 | 23.54 |