Abstract | ||
---|---|---|
We propose a novel neural network module that transforms an existing single-frame semantic segmentation model into a video semantic segmentation pipeline. In contrast to prior works, we strive towards a simple, fast, and general module that can be integrated into virtually any single-frame architecture. Our approach aggregates a rich representation of the semantic information in past frames into a memory module. Information stored in the memory is then accessed through an attention mechanism. In contrast to previous memory-based approaches, we propose a fast local attention layer, providing temporal appearance cues in the local region of prior frames. We further fuse these cues with an encoding of the current frame through a second attention-based module. The segmentation decoder processes the fused representation to predict the final semantic segmentation. We integrate our approach into two popular semantic segmentation networks: ERFNet and PSPNet. We observe an improvement in segmentation performance on Cityscapes by 1.7% and 2.1% in mIoU respectively, while increasing inference time of ERFNet by only 1.5ms. Source code is available at https://github.com/mattpfr/lmanet |
Year | DOI | Venue |
---|---|---|
2021 | 10.1109/IROS51168.2021.9636192 | 2021 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS) |
DocType | ISSN | Citations |
Conference | 2153-0858 | 0 |
PageRank | References | Authors |
0.34 | 0 | 4 |
Name | Order | Citations | PageRank |
---|---|---|---|
Matthieu Paul | 1 | 0 | 1.01 |
Danelljan Martin | 2 | 1344 | 49.35 |
Luc Van Gool | 3 | 27566 | 1819.51 |
Radu Timofte | 4 | 1880 | 118.45 |