Title
Multimodal graph inference network for scene graph generation
Abstract
A scene graph can describe images concisely and structurally. However, existing methods of scene graph generation have low capabilities of inferring certain relationships, because of the lack of semantic information and their heavy dependence on the statistical distribution of the training set. To alleviate the above problems, a Multimodal Graph Inference Network (MGIN), which includes two modules; Multimodal Information Extraction (MIE) and Target with Multimodal Feature Inference (TMFI), is proposed in this study. MGIN can increase the inference capability of triplets, especially for uncommon samples. In the proposed MIE module, the prior statistical knowledge of the training set is incorporated into the network in a reprocess to relieve the problem of overfitting to the training set. Visual and semantic features are extracted in the MIE module and fused as unified multimodal features in the TMFI module. These features are efficient for the inference module to increase the prediction capability of MGIN, especially for some uncommon samples. The proposed method achieves 27.0% average mean recall and 55.9% average recall, with improvements of 0.48% and 0.50%, respectively, compared with state-of-the-art methods. It also increases the average recall of 20 relationships with the lowest probability by 4.91%.
Year
DOI
Venue
2021
10.1007/s10489-021-02304-7
APPLIED INTELLIGENCE
Keywords
DocType
Volume
Scene graph generation, Visual relationship detection, Image understanding, Semantic analysis
Journal
51
Issue
ISSN
Citations 
12
0924-669X
0
PageRank 
References 
Authors
0.34
0
5
Name
Order
Citations
PageRank
Jingwen Duan100.34
Weidong Min2409.44
Deyu Lin300.68
Jianfeng Xu401.35
Xin Xiong562.18