What does a Car-ssette tape tell? - Citegraph

Paper Info

Title
What does a Car-ssette tape tell?

Abstract
Captioning has attracted much attention in image and video understanding while little work examines audio captioning. This paper contributes a manually-annotated dataset on car scene, in extension to a previously published hospital audio captioning dataset. An encoder-decoder model with pretrained word embeddings and additional sentence loss is proposed. This current model can accelerate the training process and generate semantically correct but unseen unique sentences. We test the model on the current car dataset, previous Hospital Dataset and the Joint Dataset, indicating its generalization capability across different scenes. Further, we make an effort to provide a better objective evaluation metric, namely the BERT similarity score. It compares the semantic-level similarity and compensates for drawbacks of N-gram based metrics like BLEU, namely high scores for word-similar sentences. This new metric demonstrates higher correlation with human evaluation. However, though detailed audio captions can now be automatically generated, human annotations still outperform model captions in many aspects.

Year	Venue	DocType
2019	arXiv: Sound	Journal
Volume	Citations	PageRank
abs/1905.13448	0	0.34
References	Authors
0	4

Authors (4 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Xuenan Xu	1	0	2.70
Heinrich Dinkel	2	23	5.79
Mengyue Wu	3	0	4.73
Kai Yu	4	1082	90.58

1