Abstract | ||
---|---|---|
Many human activities take minutes to unfold. To represent them, related works opt for statistical pooling, which neglects the temporal structure. Others opt for convolutional methods, as CNN and Non-Local. While successful in learning temporal concepts, they are short of modeling minutes-long temporal dependencies. We propose VideoGraph, a method to achieve the best of two worlds: represent minutes-long human activities and learn their underlying temporal structure. VideoGraph learns a graph-based representation for human activities. The graph, its nodes and edges are learned entirely from video datasets, making VideoGraph applicable to problems without node-level annotation. The result is improvements over related works on benchmarks: Epic-Kitchen and Breakfast. Besides, we demonstrate that VideoGraph is able to learn the temporal structure of human activities in minutes-long videos. |
Year | Venue | DocType |
---|---|---|
2019 | arXiv: Computer Vision and Pattern Recognition | Journal |
Volume | Citations | PageRank |
abs/1905.05143 | 2 | 0.35 |
References | Authors | |
0 | 3 |
Name | Order | Citations | PageRank |
---|---|---|---|
Noureldien Hussein | 1 | 12 | 1.50 |
efstratios gavves | 2 | 655 | 33.41 |
Arnold W. M. Smeulders | 3 | 6000 | 453.43 |