Title
Video Captioning Via A Symmetric Bidirectional Decoder
Abstract
The dominant video captioning methods employ the attentional encoder-decoder architecture, where the decoder is an autoregressive structure that generates sentences from left-to-right. However, these methods generally suffer from the exposure bias issue and neglect the guidance of future output contexts obtained from the right-to-left decoding. Here, the authors propose a new symmetric bidirectional decoder for video captioning. The authors first integrate the self-attentive multi-head attention and bidirectional gated recurrent unit for capturing the long-term semantic dependencies in videos. The authors then apply one single decoder to generate accurate descriptions from left-to-right and right-to-left simultaneously. The decoder in each decoding direction performs two cross-attentive multi-head attention modules to consider both the past hidden states from the same decoding direction and the future hidden states from the reverse decoding direction at each time step. A symmetric semantic-guided gated attention module is specially devised to adaptively suppress the irrelevant or misleading contents in the past or future output contexts and retain the useful ones for avoiding under-description. Experimental evaluations on two widely applied benchmark datasets: Microsoft research video to text and Microsoft video description corpus, demonstrate that the authors' proposed method obtains substantially state-of-the-art performance, which validates the superiority of the bidirectional decoder.
Year
DOI
Venue
2021
10.1049/cvi2.12043
IET COMPUTER VISION
DocType
Volume
Issue
Journal
15
4
ISSN
Citations 
PageRank 
1751-9632
0
0.34
References 
Authors
0
2
Name
Order
Citations
PageRank
Shanshan Qi100.34
Luxi Yang21180118.08