Exploring Pairwise Relationships Adaptively From Linguistic Context in Image Captioning - Citegraph

Paper Info

Title
Exploring Pairwise Relationships Adaptively From Linguistic Context in Image Captioning

Abstract
For image captioning, recent works start to focus on exploring visual relationships for generating high-quality interactive words (i.e. verbs and prepositions). However, many existing works only focus on semantic level by analysing the feature similarity between objects in the visual domain but ignore the linguistic context included in the caption decoder. When captioning is being carried out, the entity words can be inferred based on visual information of objects. The interactive words representing the relationships between entity words can only be inferred based on high-level language meaning generated in the process of captioning decoding. Such high-level language meaning is called linguistic context, which refers to the relational context between words or phrases in the caption sentences. The linguistic context can be used as strong guidance to explore related visual relationships between different objects effectively. To achieve this, we propose a novel context-adaptive attention module that is strongly driven by the linguistic context from the caption decoder. In this module, a novel design of visual relationship attention is proposed based on a bilinear self-attention model to explore related visual relationships and encode more discriminative features under the linguistic context. To achieve the adaptive process of attending to related visual relationships for generating interactive words or related visual objects for entity words, an attention modulator is integrated as an attention channel controller responding to the changing linguistic context of the caption decoder dynamically. Experimented on MSCOCO dataset, our model achieves promising performances compared with all counterpart models that explore visual relationships.

Year	DOI	Venue
2022	10.1109/TMM.2021.3093725	IEEE TRANSACTIONS ON MULTIMEDIA
Keywords	DocType	Volume
Visualization, Linguistics, Decoding, Modulation, Context modeling, Adaptation models, Semantics, Bilinear attention, bilinear self-attention, context-adaptive attention, dynamic linguistic context, image captioning, visual relationship attention	Journal	24
ISSN	Citations	PageRank
1520-9210	0	0.34
References	Authors
0	4

Authors (4 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Zongjian Zhang	1	0	0.34
Qiang Wu	2	304	40.42
Yang Wang	3	9	6.83
Fang Chen	4	0	0.34

1