Title | ||
---|---|---|
Weakly Supervised Human-Object Interaction Detection in Video via Contrastive Spatiotemporal Regions. |
Abstract | ||
---|---|---|
We introduce the task of weakly supervised learning for detecting human and object interactions in videos. Our task poses unique challenges as a system does not know what types of human-object interactions are present in a video or the actual spatiotemporal location of the human and the object. To address these challenges, we introduce a contrastive weakly supervised training loss that aims to jointly associate spatiotemporal regions in a video with an action and object vocabulary and encourage temporal continuity of the visual appearance of moving objects as a form of self-supervision. To train our model, we introduce a dataset comprising over 6.5k videos with human-object interaction annotations that have been semi-automatically curated from sentence captions associated with the videos. We demonstrate improved performance over weakly supervised baselines adapted to our task on our video dataset. |
Year | DOI | Venue |
---|---|---|
2021 | 10.1109/ICCV48922.2021.00186 | ICCV |
DocType | Citations | PageRank |
Conference | 0 | 0.34 |
References | Authors | |
0 | 5 |
Name | Order | Citations | PageRank |
---|---|---|---|
Shuang Li | 1 | 10 | 6.45 |
Yilun Du | 2 | 0 | 0.34 |
Antonio Torralba | 3 | 14607 | 956.27 |
Josef Sivic | 4 | 9653 | 513.44 |
Bryan C. Russell | 5 | 2570 | 217.78 |