Abstract | ||
---|---|---|
In this paper, we focus on the problem of applying the transformer structure to video captioning effectively. The vanilla transformer is proposed for uni-modal language generation task such as machine translation. However, video captioning is a multimodal learning problem, and the video features have much redundancy between different time steps. Based on these concerns, we propose a novel method called sparse boundary-aware transformer (SBAT) to reduce the redundancy in video representation. SBAT employs boundary-aware pooling operation for scores from multihead attention and selects diverse features from different scenarios. Also, SBAT includes a local correlation scheme to compensate for the local information loss brought by sparse operation. Based on SBAT, we further propose an aligned cross-modal encoding scheme to boost the multimodal interaction. Experimental results on two benchmark datasets show that SBAT outperforms the state-of-the-art methods under most of the metrics. |
Year | DOI | Venue |
---|---|---|
2020 | 10.24963/ijcai.2020/88 | IJCAI 2020 |
DocType | Citations | PageRank |
Conference | 4 | 0.39 |
References | Authors | |
0 | 5 |
Name | Order | Citations | PageRank |
---|---|---|---|
Tao Jin | 1 | 17 | 6.96 |
Siyu Huang | 2 | 41 | 7.23 |
Ming Chen | 3 | 581 | 85.60 |
Yingming Li | 4 | 57 | 14.82 |
Zhongfei (Mark) Zhang | 5 | 2451 | 164.30 |