Abstract | ||
---|---|---|
This paper presents a novel framework to combine multiple layers and modalities of deep neural networks for video classification. We first propose a multilayer strategy to simultaneously capture a variety of levels of abstraction and invariance in a network, where the convolutional and fully connected layers are effectively represented by our proposed feature aggregation methods. We further introduce a multimodal scheme that includes four highly complementary modalities to extract diverse static and dynamic cues at multiple temporal scales. In particular, for modeling the long-term temporal information, we propose a new structure, FC-RNN, to effectively transform pre-trained fully connected layers into recurrent layers. A robust boosting model is then introduced to optimize the fusion of multiple layers and modalities in a unified way. In the extensive experiments, we achieve state-of-the-art results on two public benchmark datasets: UCF101 and HMDB51. |
Year | DOI | Venue |
---|---|---|
2016 | 10.1145/2964284.2964297 | ACM Multimedia |
Keywords | Field | DocType |
Video Classification,Deep Neural Networks,Boosting,Fusion,CNN,RNN | Modalities,Temporal scales,Abstraction,Invariant (physics),Computer science,Fusion,Boosting (machine learning),Artificial intelligence,Feature aggregation,Deep neural networks,Machine learning | Conference |
Citations | PageRank | References |
24 | 0.78 | 30 |
Authors | ||
3 |
Name | Order | Citations | PageRank |
---|---|---|---|
Xiaodong Yang | 1 | 1094 | 41.92 |
Pavlo O. Molchanov | 2 | 198 | 11.96 |
Jan Kautz | 3 | 3615 | 198.77 |