SBAT: Video Captioning with Sparse Boundary-Aware Transformer - Citegraph

Paper Info

Title
SBAT: Video Captioning with Sparse Boundary-Aware Transformer

Abstract
In this paper, we focus on the problem of applying the transformer structure to video captioning effectively. The vanilla transformer is proposed for uni-modal language generation task such as machine translation. However, video captioning is a multimodal learning problem, and the video features have much redundancy between different time steps. Based on these concerns, we propose a novel method called sparse boundary-aware transformer (SBAT) to reduce the redundancy in video representation. SBAT employs boundary-aware pooling operation for scores from multihead attention and selects diverse features from different scenarios. Also, SBAT includes a local correlation scheme to compensate for the local information loss brought by sparse operation. Based on SBAT, we further propose an aligned cross-modal encoding scheme to boost the multimodal interaction. Experimental results on two benchmark datasets show that SBAT outperforms the state-of-the-art methods under most of the metrics.

Year	DOI	Venue
2020	10.24963/ijcai.2020/88	IJCAI 2020
DocType	Citations	PageRank
Conference	4	0.39
References	Authors
0	5

Authors (5 rows)

Cited by (4 rows)

References (0 rows)

Name	Order	Citations	PageRank
Tao Jin	1	17	6.96
Siyu Huang	2	41	7.23
Ming Chen	3	581	85.60
Yingming Li	4	57	14.82
Zhongfei (Mark) Zhang	5	2451	164.30

1