Title
Cross-Domain Modality Fusion for Dense Video Captioning
Abstract
Dense video captioning requires localization and description of multiple events in long videos. Prior works detect events in videos solely relying on the visual content and completely ignore the semantics (captions) related to the events. This is undesirable because human-provided captions often also describe events that are visually nonpresent or subtle to detect. In this research, we propose to capitalize on this natural kinship between <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">events</i> and their human-provided descriptions. We propose a semantic contextualization network to encode the visual content of videos by representing it in a semantic space. The representation is further refined to incorporate temporal information and transformed into event descriptors using a hierarchical application of short Fourier transform. Our proposal network exploits the fusion of semantic and visual content enabling it to generate semantically meaningful event proposals. For each proposed event, we attentively fuse its hidden state and descriptors to compute discriminative representation for the subsequent captioning network. Thorough experiments on the standard large-scale ActivityNet Captions dataset and additionally on the YouCook-II dataset show that our method achieves competitive or better performance on multiple popular metrics for the problem.
Year
DOI
Venue
2022
10.1109/TAI.2021.3134190
IEEE Transactions on Artificial Intelligence
Keywords
DocType
Volume
Context modeling,dense video captioning (DVC),event localization,language and vision,video captioning
Journal
3
Issue
ISSN
Citations 
5
2691-4581
0
PageRank 
References 
Authors
0.34
29
5
Name
Order
Citations
PageRank
Nayyer Aafaq100.34
A. Mian2167984.89
Wei Liu3144.57
Naveed Akhtar418412.43
Mubarak Shah516522943.74