Abstract | ||
---|---|---|
Audio-visual representation learning is an important task from the perspective of designing machines with the ability to understand complex events. To this end, we propose a novel multimodal framework that instantiates multiple instance learning. We show that the learnt representations are useful for classifying events and localizing their characteristic audio-visual elements. The system is trained using only video-level event labels without any timing information. An important feature of our method is its capacity to learn from unsynchronized audio-visual events. We achieve state-of-the-art results on a large-scale dataset of weakly-labeled audio event videos. Visualizations of localized visual regions and audio segments substantiate our systemu0027s efficacy, especially when dealing with noisy situations where modality-specific cues appear asynchronously. |
Year | Venue | DocType |
---|---|---|
2018 | CVPR Workshops | Conference |
Volume | Citations | PageRank |
abs/1804.07345 | 0 | 0.34 |
References | Authors | |
0 | 6 |
Name | Order | Citations | PageRank |
---|---|---|---|
Sanjeel Parekh | 1 | 3 | 2.48 |
Slim Essid | 2 | 212 | 32.00 |
Alexey Ozerov | 3 | 637 | 37.14 |
Ngoc Q. K. Duong | 4 | 288 | 21.11 |
Patrick Pérez | 5 | 6529 | 391.34 |
Gaël Richard | 6 | 1220 | 110.40 |