Abstract | ||
---|---|---|
Audio-visual event localization requires one to identify the event which is both visible and audible in a video (either at a frame or video level). To address this task, we propose a deep neural network named Audio-Visual sequence-to-sequence dual network (AVSDN). By jointly taking both audio and visual features at each time segment as inputs, our proposed model learns global and local event information in a sequence to sequence manner, which can be realized in either fully supervised or weakly supervised settings. Empirical results confirm that our proposed method performs favorably against recent deep learning approaches in both settings. |
Year | DOI | Venue |
---|---|---|
2019 | 10.1109/icassp.2019.8683226 | 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) |
Keywords | Field | DocType |
Audio-Video Features, Dual Modality, Event Localization, Deep Learning | Pattern recognition,Computer science,Artificial intelligence,Deep learning,Artificial neural network | Journal |
Volume | ISSN | Citations |
abs/1902.07473 | 1520-6149 | 0 |
PageRank | References | Authors |
0.34 | 0 | 3 |
Name | Order | Citations | PageRank |
---|---|---|---|
Yan-Bo Lin | 1 | 0 | 1.01 |
Yu-Jhe Li | 2 | 4 | 1.05 |
Yu-Chiang Frank Wang | 3 | 914 | 61.63 |