Title
Cascaded Multilingual Audio-Visual Learning from Videos.
Abstract
In this paper, we explore self-supervised audio-visual models that learn from instructional videos. Prior work has shown that these models can relate spoken words and sounds to visual content after training on a large-scale dataset of videos, but they were only trained and evaluated on videos in English. To learn multilingual audio-visual representations, we propose a cascaded approach that leverages a model trained on English videos and applies it to audio-visual data in other languages, such as Japanese videos. With our cascaded approach, we show an improvement in retrieval performance of nearly 10x compared to training on the Japanese videos solely. We also apply the model trained on English videos to Japanese and Hindi spoken captions of images, achieving state-of-the-art performance.
Year
DOI
Venue
2021
10.21437/Interspeech.2021-1352
Interspeech
DocType
Citations 
PageRank 
Conference
0
0.34
References 
Authors
0
11
Name
Order
Citations
PageRank
Andrew Rouditchenko102.03
Angie Boggust200.68
David F. Harwath3638.34
Samuel Thomas453646.88
Hilde Kuehne501.01
Brian Chen631.39
Rameswar Panda78514.02
Rogério Feris8152989.95
B. Kingsbury94175335.43
Michael Picheny101461920.15
James Glass113123413.63