Title | ||
---|---|---|
Not All Frames Are Equal: Weakly-Supervised Video Grounding With Contextual Similarity And Visual Clustering Losses |
Abstract | ||
---|---|---|
We investigate the problem of weakly-supervised video grounding, where only video-level sentences are provided. This is a challenging task, and previous Multi-Instance Learning (MIL) based image grounding methods turn to fail in the video domain. Recent work attempts to decompose the video-level MIL into frame-level MIL by applying weighted sentence-frame ranking loss over frames, but it is not robust and does not exploit the rich temporal information in videos. In this work, we address these issues by extending frame-level MIL with a false positive frame-bag constraint and modeling the visual feature consistency in the video. In specific, we design a contextual similarity between semantic and visual features to deal with sparse objects association across frames. Furthermore, we leverage temporal coherence by strengthening the clustering effect of similar features in the visual space. We conduct an extensive evaluation on YouCookII and RoboWatch datasets, and demonstrate our method significantly outperforms prior state-of-the-art methods. |
Year | DOI | Venue |
---|---|---|
2019 | 10.1109/CVPR.2019.01069 | 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019) |
Field | DocType | ISSN |
Computer vision,Pattern recognition,Computer science,Ground,Artificial intelligence,Cluster analysis | Conference | 1063-6919 |
Citations | PageRank | References |
0 | 0.34 | 0 |
Authors | ||
4 |
Name | Order | Citations | PageRank |
---|---|---|---|
Jing Shi | 1 | 0 | 2.37 |
Jia Xu | 2 | 146 | 7.45 |
Boqing Gong | 3 | 685 | 33.29 |
Chenliang Xu | 4 | 434 | 28.73 |