Abstract | ||
---|---|---|
The dominant video captioning methods employ the attentional encoder-decoder architecture, where the decoder is an autoregressive structure that generates sentences from left-to-right. However, these methods generally suffer from the exposure bias issue and neglect the guidance of future output contexts obtained from the right-to-left decoding. Here, the authors propose a new symmetric bidirectional decoder for video captioning. The authors first integrate the self-attentive multi-head attention and bidirectional gated recurrent unit for capturing the long-term semantic dependencies in videos. The authors then apply one single decoder to generate accurate descriptions from left-to-right and right-to-left simultaneously. The decoder in each decoding direction performs two cross-attentive multi-head attention modules to consider both the past hidden states from the same decoding direction and the future hidden states from the reverse decoding direction at each time step. A symmetric semantic-guided gated attention module is specially devised to adaptively suppress the irrelevant or misleading contents in the past or future output contexts and retain the useful ones for avoiding under-description. Experimental evaluations on two widely applied benchmark datasets: Microsoft research video to text and Microsoft video description corpus, demonstrate that the authors' proposed method obtains substantially state-of-the-art performance, which validates the superiority of the bidirectional decoder. |
Year | DOI | Venue |
---|---|---|
2021 | 10.1049/cvi2.12043 | IET COMPUTER VISION |
DocType | Volume | Issue |
Journal | 15 | 4 |
ISSN | Citations | PageRank |
1751-9632 | 0 | 0.34 |
References | Authors | |
0 | 2 |
Name | Order | Citations | PageRank |
---|---|---|---|
Shanshan Qi | 1 | 0 | 0.34 |
Luxi Yang | 2 | 1180 | 118.08 |