Abstract | ||
---|---|---|
Semantic video segmentation is a challenging task of fine-grained semantic understanding of video data. In this paper, we present a jointly trained deep learning framework to make the best use of spatial and temporal information for semantic video segmentation. Along the spatial dimension, a hierarchically supervised deconvolutional neural network (HDCNN) is proposed to conduct pixel-wise semantic interpretation for single video frames. HDCNN is constructed with convolutional layers in VGG-net and their mirrored deconvolutional structure, where all fully connected layers are removed. And hierarchical classification layers are added to multi-scale deconvolutional features to introduce more contextual information for pixel-wise semantic interpretation. Besides, a coarse-to-fine training strategy is adopted to enhance the performance of foreground object segmentation in videos. Along the temporal dimension, we introduce Transition Layers upon the structure of HDCNN to make the pixel-wise label prediction consist with adjacent pixels across space and time domains. The learning process of the Transition Layers can be implemented as a set of extra convolutional calculations connected with HDCNN. These two parts are jointly trained as a unified deep network in our approach. Thorough evaluations are performed on two challenging video datasets, i.e., CamVid and GATECH. Our approach achieves state-of-the-art performance on both of the two datasets. A unified deep learning framework is proposed to employ the spatio-temporal information for semantic video segmentation.A hierarchically supervised deconvolutional network is proposed to conduct semantic segmentation for single video frames.A coarse-to-fine training strategy is adopted to improve the foreground object segmentation.Transition Layers are introduced to make the label prediction consist with adjacent pixels across space and time domains.The state-of-the-art performance is achieved on the two datasets, CamVid and GATECH. |
Year | DOI | Venue |
---|---|---|
2017 | 10.1016/j.patcog.2016.09.046 | Pattern Recognition |
Keywords | Field | DocType |
Semantic video segmentation,Deconvolutional neural network,Coarse-to-fine training,Spatio-temporal consistence | Contextual information,Pattern recognition,Computer science,Segmentation,Semantic interpretation,Artificial intelligence,Pixel,Deep learning,Artificial neural network,Machine learning | Journal |
Volume | Issue | ISSN |
64 | C | 0031-3203 |
Citations | PageRank | References |
10 | 0.50 | 24 |
Authors | ||
6 |
Name | Order | Citations | PageRank |
---|---|---|---|
Yuhang Wang | 1 | 204 | 14.84 |
Jing Liu | 2 | 1781 | 88.09 |
Yong Li | 3 | 254 | 28.66 |
Jun Fu | 4 | 157 | 7.24 |
Min Xu | 5 | 398 | 36.60 |
Hanqing Lu | 6 | 4620 | 291.38 |