Weakly Supervised Human-Object Interaction Detection in Video via Contrastive Spatiotemporal Regions. - Citegraph

Paper Info

Title
Weakly Supervised Human-Object Interaction Detection in Video via Contrastive Spatiotemporal Regions.

Abstract
We introduce the task of weakly supervised learning for detecting human and object interactions in videos. Our task poses unique challenges as a system does not know what types of human-object interactions are present in a video or the actual spatiotemporal location of the human and the object. To address these challenges, we introduce a contrastive weakly supervised training loss that aims to jointly associate spatiotemporal regions in a video with an action and object vocabulary and encourage temporal continuity of the visual appearance of moving objects as a form of self-supervision. To train our model, we introduce a dataset comprising over 6.5k videos with human-object interaction annotations that have been semi-automatically curated from sentence captions associated with the videos. We demonstrate improved performance over weakly supervised baselines adapted to our task on our video dataset.

Year	DOI	Venue
2021	10.1109/ICCV48922.2021.00186	ICCV
DocType	Citations	PageRank
Conference	0	0.34
References	Authors
0	5

Authors (5 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Shuang Li	1	10	6.45
Yilun Du	2	0	0.34
Antonio Torralba	3	14607	956.27
Josef Sivic	4	9653	513.44
Bryan C. Russell	5	2570	217.78

1