Abstract | ||
---|---|---|
Human action recognition is an important task in computer vision. Recently, deep learning methods for video action recognition have developed rapidly. A popular way to tackle this problem is known as two-stream methods which take both spatial and temporal modalities into consideration. These methods often treat sparsely-sampled frames as input and video labels as supervision. Because of such sampling strategy, they are typically limited to processing shorter sequences, which might cause the problems such as suffering from the confusion by partial observation. In this paper we propose a novel video feature representation method, called Deep Temporal Feature Encoding (DTE). It could aggregate frame-level features into a robust and global video-level representation. Firstly, we sample enough RGB frames and optical flow stacks across the whole video. Then we use a deep temporal feature encoding layer to construct a strong video feature. Lastly, end-to-end training is applied so that our video representation could be global and sequence-aware. Comprehensive experiments are conducted on two public datasets: HMDB51 and UCF101. Experimental results demonstrate that DTE achieves the competitive state-of-the-art performance on both datasets. |
Year | DOI | Venue |
---|---|---|
2018 | 10.1109/ICPR.2018.8546263 | 2018 24TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR) |
Field | DocType | ISSN |
Modalities,Computer vision,Task analysis,Pattern recognition,Computer science,Feature extraction,Sampling (statistics),RGB color model,Artificial intelligence,Deep learning,Optical flow,Encoding (memory) | Conference | 1051-4651 |
Citations | PageRank | References |
0 | 0.34 | 0 |
Authors | ||
4 |
Name | Order | Citations | PageRank |
---|---|---|---|
Lin Li | 1 | 4 | 1.06 |
Zhaoxiang Zhang | 2 | 1022 | 99.76 |
Yan Huang | 3 | 226 | 27.65 |
Liang Wang | 4 | 4317 | 243.28 |