Title
What does a Car-ssette tape tell?
Abstract
Captioning has attracted much attention in image and video understanding while little work examines audio captioning. This paper contributes a manually-annotated dataset on car scene, in extension to a previously published hospital audio captioning dataset. An encoder-decoder model with pretrained word embeddings and additional sentence loss is proposed. This current model can accelerate the training process and generate semantically correct but unseen unique sentences. We test the model on the current car dataset, previous Hospital Dataset and the Joint Dataset, indicating its generalization capability across different scenes. Further, we make an effort to provide a better objective evaluation metric, namely the BERT similarity score. It compares the semantic-level similarity and compensates for drawbacks of N-gram based metrics like BLEU, namely high scores for word-similar sentences. This new metric demonstrates higher correlation with human evaluation. However, though detailed audio captions can now be automatically generated, human annotations still outperform model captions in many aspects.
Year
Venue
DocType
2019
arXiv: Sound
Journal
Volume
Citations 
PageRank 
abs/1905.13448
0
0.34
References 
Authors
0
4
Name
Order
Citations
PageRank
Xuenan Xu102.70
Heinrich Dinkel2235.79
Mengyue Wu304.73
Kai Yu4108290.58