Cascaded Multilingual Audio-Visual Learning from Videos. - Citegraph

Paper Info

Title
Cascaded Multilingual Audio-Visual Learning from Videos.

Abstract
In this paper, we explore self-supervised audio-visual models that learn from instructional videos. Prior work has shown that these models can relate spoken words and sounds to visual content after training on a large-scale dataset of videos, but they were only trained and evaluated on videos in English. To learn multilingual audio-visual representations, we propose a cascaded approach that leverages a model trained on English videos and applies it to audio-visual data in other languages, such as Japanese videos. With our cascaded approach, we show an improvement in retrieval performance of nearly 10x compared to training on the Japanese videos solely. We also apply the model trained on English videos to Japanese and Hindi spoken captions of images, achieving state-of-the-art performance.

Year	DOI	Venue
2021	10.21437/Interspeech.2021-1352	Interspeech
DocType	Citations	PageRank
Conference	0	0.34
References	Authors
0	11

Authors (11 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Andrew Rouditchenko	1	0	2.03
Angie Boggust	2	0	0.68
David F. Harwath	3	63	8.34
Samuel Thomas	4	536	46.88
Hilde Kuehne	5	0	1.01
Brian Chen	6	3	1.39
Rameswar Panda	7	85	14.02
Rogério Feris	8	1529	89.95
B. Kingsbury	9	4175	335.43
Michael Picheny	10	1461	920.15
James Glass	11	3123	413.63

1