Abstract | ||
---|---|---|
The sound of crashing waves, the roar of fast-moving cars-sound conveys important information about the objects in our surroundings. In this work, we show that ambient sounds can be used as a supervisory signal for learning visual models. To demonstrate this, we train a convolutional neural network to predict a statistical summary of the sound associated with a video frame. We show that, through this process, the network learns a representation that conveys information about objects and scenes. We evaluate this representation on several recognition tasks, finding that its performance is comparable to that of other state-of-the-art unsupervised learning methods. Finally, we show through visualizations that the network learns units that are selective to objects that are often associated with characteristic sounds. |
Year | DOI | Venue |
---|---|---|
2016 | 10.1007/978-3-319-46448-0_48 | COMPUTER VISION - ECCV 2016, PT I |
Keywords | DocType | Volume |
Sound, Convolutional networks, Unsupervised learning | Conference | 9905 |
ISSN | Citations | PageRank |
0302-9743 | 49 | 1.42 |
References | Authors | |
22 | 5 |
Name | Order | Citations | PageRank |
---|---|---|---|
Andrew Owens | 1 | 74 | 5.13 |
Yichen Wei | 2 | 814 | 47.77 |
Josh H. McDermott | 3 | 69 | 8.28 |
William T. Freeman | 4 | 17382 | 1968.76 |
Antonio Torralba | 5 | 14607 | 956.27 |