Title
Human Object Interaction Detection via Multi-level Conditioned Network
Abstract
As one of the essential problems in scene understanding, human object interaction detection (HOID) aims to recognize fine-grained object-specific human actions, which demands the capabilities of both visual perception and reasoning. Existing methods based on convolutional neural network (CNN) utilize diverse visual features for HOID, which are insufficient for complex human object interaction understanding. To enhance the reasoning capablity of CNN, we propose a novel multi-level conditioned network that fuses extra spatial-semantic knowledge with visual features. Specifically, we construct a multi-branch CNN as backbone for multi-level visual representation. We then encode extra knowledge including human body structure and object context as condition to dynamically influence the feature extraction of CNN by affine transformation and attention mechanism. Finally, we fuse the modulated multimodal features to distinguish the interactions. The proposed method is evaluated on two most frequently-used benchmarks, HICO-DET and V-COCO. The experiment results show that our method is superior to the state-of-the-arts.
Year
DOI
Venue
2020
10.1145/3372278.3390671
ICMR '20: International Conference on Multimedia Retrieval Dublin Ireland June, 2020
DocType
ISBN
Citations 
Conference
978-1-4503-7087-5
0
PageRank 
References 
Authors
0.34
0
4
Name
Order
Citations
PageRank
Xu Sun100.34
Xinwen Hu221.77
Tongwei Ren332830.22
Gang-Shan Wu4276.75