Abstract | ||
---|---|---|
Automatic Music Transcription has seen significant progress in recent years by training custom deep neural networks on large datasets. However, these models have required extensive domain-specific design of network architectures, input/output representations, and complex decoding schemes. In this work, we show that equivalent performance can be achieved using a generic encoder-decoder Transformer with standard decoding methods. We demonstrate that the model can learn to translate spectrogram inputs directly to MIDI-like output events for several transcription tasks. This sequence-to-sequence approach simplifies transcription by jointly modeling audio features and language-like output dependencies, thus removing the need for task-specific architectures. These results point toward possibilities for creating new Music Information Retrieval models by focusing on dataset creation and labeling rather than custom model design. |
Year | Venue | DocType |
---|---|---|
2021 | ISMIR | Conference |
Citations | PageRank | References |
0 | 0.34 | 0 |
Authors | ||
5 |
Name | Order | Citations | PageRank |
---|---|---|---|
Curtis Hawthorne | 1 | 27 | 4.39 |
Ian Simon | 2 | 675 | 46.26 |
Rigel Swavely | 3 | 0 | 0.34 |
Ethan Manilow | 4 | 0 | 0.68 |
Jesse H. Engel | 5 | 326 | 20.21 |