Abstract | ||
---|---|---|
We consider the task of 3D pose estimation and tracking of multiple people seen in an arbitrary number of camera feeds. We propose TesseTrack1, a novel top-down approach that simultaneously reasons about multiple individuals' 3D body joint reconstructions and associations in space and time in a single end-to-end learnable framework. At the core of our approach is a novel spatio-temporal formulation that operates in a common voxelized feature space aggregated from single- or multiple camera views. After a person detection step, a 4D CNN produces short-term person-specific representations which are then linked across time by a differentiable matcher. The linked descriptions are then merged and deconvolved into 3D poses. This joint spatio-temporal formulation contrasts with previous piecewise strategies that treat 2D pose estimation, 2D-to-3D lifting, and 3D pose tracking as independent sub-problems that are error-prone when solved in isolation. Furthermore, unlike previous methods, TesseTrack is robust to changes in the number of camera views and achieves very good results even if a single view is available at inference time. Quantitative evaluation of 3D pose reconstruction accuracy on standard benchmarks shows significant improvements over the state of the art. Evaluation of multi-person articulated 3D pose tracking in our novel evaluation framework demonstrates the superiority of TesseTrack over strong baselines. |
Year | DOI | Venue |
---|---|---|
2021 | 10.1109/CVPR46437.2021.01494 | 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021 |
DocType | ISSN | Citations |
Conference | 1063-6919 | 1 |
PageRank | References | Authors |
0.34 | 0 | 5 |
Name | Order | Citations | PageRank |
---|---|---|---|
N Dinesh Reddy | 1 | 1 | 0.34 |
Laurent Guigues | 2 | 1 | 0.34 |
Leonid Pishchulin | 3 | 1 | 0.34 |
Jayan Eledath | 4 | 1 | 1.02 |
Narasimhan, S.G. | 5 | 2348 | 169.35 |