Title | ||
---|---|---|
Enhancing the alignment between target words and corresponding frames for video captioning |
Abstract | ||
---|---|---|
•Visual tags are introduced to bridge the gap between vision and language.•A textual-temporal attention model is devised and incorporated into the decoder to build exact alignment between target words and corresponding frames.•Extensive experiments on two well-known datasets, i.e., MSVD and MSR-VTT, demonstrate that our proposed approach achieves remarkable improvements over the state-of-the-art methods. |
Year | DOI | Venue |
---|---|---|
2021 | 10.1016/j.patcog.2020.107702 | Pattern Recognition |
Keywords | DocType | Volume |
Video captioning,Alignment,Visual tags,Textual-temporal attention | Journal | 111 |
Issue | ISSN | Citations |
1 | 0031-3203 | 4 |
PageRank | References | Authors |
0.39 | 0 | 5 |
Name | Order | Citations | PageRank |
---|---|---|---|
Yunbin Tu | 1 | 30 | 2.85 |
Chang Zhou | 2 | 4 | 0.39 |
Junjun Guo | 3 | 5 | 4.47 |
Shengxiang Gao | 4 | 5 | 5.17 |
Zhengtao Yu | 5 | 460 | 69.08 |