Title
Learning frame-level affinity with video-level labels for weakly supervised temporal action detection
Abstract
Weakly supervised temporal action detection aims at localizing actions with only video-level labels rather than lots of frame-level labels. To this end, previous methods train a classification network for mining discernible action frames as detection results. However, the classification network is known to only concentrate on local discernible frames rather than the entire action instance. Therefore, substantial numbers of indiscernible action frames are not detected and the detection results are incomplete. To alleviate this issue, we propose a novel method to facilitate the detection of indiscernible frames based on learning frame-level affinities. In the proposed method, we design a network (named Affinity Network) for predicting affinities between pairs of adjacent frames. Then, the affinities are used as transition probabilities to propagate local responses to indiscernible frames. As a result, the responses of indiscernible frames can be enhanced and the detection of them can be facilitated. For learning the network, we propose strategies to synthesize frame-pair and video-pair training samples, which are conducive to learn frame-level affinities with only video-level labels. The experimental results on THUMOS14 dataset and ActivityNet1.2 dataset show that the detection performance of our framework outperforms most previous weakly supervised action detection methods, and is even as competitive as some fully supervised action detection methods.
Year
DOI
Venue
2021
10.1016/j.neucom.2021.07.059
Neurocomputing
Keywords
DocType
Volume
Video understanding,Temporal action detection,Weakly supervised learning
Journal
463
ISSN
Citations 
PageRank 
0925-2312
0
0.34
References 
Authors
0
4
Name
Order
Citations
PageRank
Bairong Li121.38
Zhu Yuesheng211239.21
Ruixin Liu323.41
Zhenyu Weng400.68