Title
Mask and Predict: Multi-step Reasoning for Scene Graph Generation
Abstract
ABSTRACTScene Graph Generation (SGG) aims to parse the image as a set of semantics, containing objects and their relations. Currently, the SGG methods only stay at presenting the intuitive detection in the image, such as the triplet "logo on board". Intuitively, we humans can further refine these intuitive detections as rational descriptions like "flower painted on surfboard". However, most of existing methods always formulate SGG as a straightforward task, only limited by the manner of one-time prediction, which focuses on a single-pass pipeline and predicts all the semantic. Therefore, to handle this problem, we propose a novel multi-step reasoning manner for SGG. Concretely, we break SGG into two explicit learning stages, including intuitive training stage (ITS) and rational training stage (RTS). In the first stage, we follow the traditional SGG processing to detect objects and relationships, yielding an intuitive scene graph. In the second stage, we perform multi-step reasoning to refine the intuitive scene graph. For each step of reasoning, it consists of two kinds of operations: mask and predict. According to primary predictions and their confidences, we constantly select and mask the low-confidence predictions, which features are optimized and predicted again. After several iterations, all of intuitive semantics will gradually tend to be revised with high confidences, yielding a rational scene graph. Extensive experiments on Visual Genome prove the superiority of the proposed method. Additional ablation studies and visualization cases further validate its effectiveness.
Year
DOI
Venue
2021
10.1145/3474085.3475545
International Multimedia Conference
DocType
Citations 
PageRank 
Conference
0
0.34
References 
Authors
0
7
Name
Order
Citations
PageRank
Hongshuo Tian101.35
Ning Xu2596.63
Anan Liu382362.46
Chenggang Yan441032.87
Zhendong Mao530725.18
Quan Zhang600.68
Yongdong Zhang726327.77