Effects of Word-Frequency Based Pre- and Post- Processings for Audio Captioning. - Citegraph

Paper Info

Title
Effects of Word-Frequency Based Pre- and Post- Processings for Audio Captioning.

Abstract
The system we used for Task 6 (Automated Audio Captioning)of the Detection and Classification of Acoustic Scenes and Events(DCASE) 2020 Challenge combines three elements, namely, dataaugmentation, multi-task learning, and post-processing, for audiocaptioning. The system received the highest evaluation scores, butwhich of the individual elements most fully contributed to its perfor-mance has not yet been clarified. Here, to asses their contributions,we first conducted an element-wise ablation study on our systemto estimate to what extent each element is effective. We then con-ducted a detailed module-wise ablation study to further clarify thekey processing modules for improving accuracy. The results showthat data augmentation and post-processing significantly improvethe score in our system. In particular, mix-up data augmentationand beam search in post-processing improve SPIDEr by 0.8 and 1.6points, respectively.

Year	Venue	DocType
2020	DCASE	Conference
Citations	PageRank	References
0	0.34	0
Authors
5

Authors (5 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Daiki Takeuchi	1	5	3.43
Koizumi Yuma	2	41	11.75
Yasunori Ohishi	3	0	2.37
Noboru Harada	4	0	1.01
Kunio Kashino	5	0	4.06

1