DIFNet: Boosting Visual Information Flow for Image Captioning - Citegraph

Paper Info

Title
DIFNet: Boosting Visual Information Flow for Image Captioning

Abstract
Current Image Captioning (IC) methods predict textual words sequentially based on the input visual information from the visual feature extractor and the partially generated sentence information. However, for most cases, the partially generated sentence may dominate the target word prediction due to the insufficiency of visual information, making the generated descriptions irrelevant to the content of the given image. In this paper, we propose a Dual Information Flow Network (DIFNet 1 1 Source code is available at: https://github.com/mrwu-mac/DIFNet) to address this issue, which takes segmentation feature as another visual information source to enhance the contribution of visual information for prediction. To maximize the use of two information flows, we also propose an effective feature fusion module termed Iterative Independent Layer Normalization (IILN) which can condense the most relevant inputs while retraining modality-specific information in each flow. Experiments show that our method is able to enhance the dependence of prediction on visual information, making word prediction more focused on the visual content, and thus achieves new state-of-the-art performance on the MSCOCO dataset, e.g., 136.2 CIDEr on COCO Karpathy test split.

Year	DOI	Venue
2022	10.1109/CVPR52688.2022.01749	IEEE Conference on Computer Vision and Pattern Recognition
Keywords	DocType	Volume
Vision + language	Conference	2022
Issue	Citations	PageRank
1	0	0.34
References	Authors
0	8

Authors (8 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Mingrui Wu	1	0	0.34
Xuying Zhang	2	0	1.01
Xiaoshuai Sun	3	623	58.76
Yiyi Zhou	4	0	0.68
Chao Chen	5	0	0.68
Jiaxin Gu	6	0	0.34
Sun Xing	7	33	10.94
Rongrong Ji	8	0	0.34

1