Multimodal graph inference network for scene graph generation - Citegraph

Paper Info

Title
Multimodal graph inference network for scene graph generation

Abstract
A scene graph can describe images concisely and structurally. However, existing methods of scene graph generation have low capabilities of inferring certain relationships, because of the lack of semantic information and their heavy dependence on the statistical distribution of the training set. To alleviate the above problems, a Multimodal Graph Inference Network (MGIN), which includes two modules; Multimodal Information Extraction (MIE) and Target with Multimodal Feature Inference (TMFI), is proposed in this study. MGIN can increase the inference capability of triplets, especially for uncommon samples. In the proposed MIE module, the prior statistical knowledge of the training set is incorporated into the network in a reprocess to relieve the problem of overfitting to the training set. Visual and semantic features are extracted in the MIE module and fused as unified multimodal features in the TMFI module. These features are efficient for the inference module to increase the prediction capability of MGIN, especially for some uncommon samples. The proposed method achieves 27.0% average mean recall and 55.9% average recall, with improvements of 0.48% and 0.50%, respectively, compared with state-of-the-art methods. It also increases the average recall of 20 relationships with the lowest probability by 4.91%.

Year	DOI	Venue
2021	10.1007/s10489-021-02304-7	APPLIED INTELLIGENCE
Keywords	DocType	Volume
Scene graph generation, Visual relationship detection, Image understanding, Semantic analysis	Journal	51
Issue	ISSN	Citations
12	0924-669X	0
PageRank	References	Authors
0.34	0	5

Authors (5 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Jingwen Duan	1	0	0.34
Weidong Min	2	40	9.44
Deyu Lin	3	0	0.68
Jianfeng Xu	4	0	1.35
Xin Xiong	5	6	2.18

1