Abstract | ||
---|---|---|
The system we used for Task 6 (Automated Audio Captioning)of the Detection and Classification of Acoustic Scenes and Events(DCASE) 2020 Challenge combines three elements, namely, dataaugmentation, multi-task learning, and post-processing, for audiocaptioning. The system received the highest evaluation scores, butwhich of the individual elements most fully contributed to its perfor-mance has not yet been clarified. Here, to asses their contributions,we first conducted an element-wise ablation study on our systemto estimate to what extent each element is effective. We then con-ducted a detailed module-wise ablation study to further clarify thekey processing modules for improving accuracy. The results showthat data augmentation and post-processing significantly improvethe score in our system. In particular, mix-up data augmentationand beam search in post-processing improve SPIDEr by 0.8 and 1.6points, respectively. |
Year | Venue | DocType |
---|---|---|
2020 | DCASE | Conference |
Citations | PageRank | References |
0 | 0.34 | 0 |
Authors | ||
5 |
Name | Order | Citations | PageRank |
---|---|---|---|
Daiki Takeuchi | 1 | 5 | 3.43 |
Koizumi Yuma | 2 | 41 | 11.75 |
Yasunori Ohishi | 3 | 0 | 2.37 |
Noboru Harada | 4 | 0 | 1.01 |
Kunio Kashino | 5 | 0 | 4.06 |