Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos. - Citegraph

Paper Info

Title
Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos.

Abstract
The task of video grounding, which temporally localizes a natural language description in a video, plays an important role in understanding videos. Existing studies have adopted strategies of sliding window over the entire video or exhaustively ranking all possible clip-sentence pairs in a pre-segmented video, which inevitably suffer from exhaustively enumerated candidates. To alleviate this problem, we formulate this task as a problem of sequential decision making by learning an agent which regulates the temporal grounding boundaries progressively based on its policy. Specifically. we propose a reinforcement learning based framework improved by multi-task learning and it shows steady performance gains by considering additional supervised boundary information during training. Our proposed framework achieves state-of-the-art performance on ActivityNet'18 DenseCaption dataset (Krishna et al. 2017) and Charades-STA dataset (Sigurdsson et al. 2016; Gao et al. 2017) while observing only 10 or less clips per video.

Year	Venue	Field
2019	THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE	Sliding window protocol,Ranking,Computer science,Natural language,Ground,Artificial intelligence,Machine learning,Reinforcement learning,CLIPS
DocType	Volume	Citations
Journal	abs/1901.06829	1
PageRank	References	Authors
0.35	0	6

Authors (6 rows)

Cited by (1 rows)

References (0 rows)

Name	Order	Citations	PageRank
He, D.	1	33	13.67
Xiang Zhao	2	1	4.40
Jizhou Huang	3	58	7.65
Fu Li	4	25	8.88
Xiao Liu	5	284	41.90
Shilei Wen	6	79	13.59

1