Abstract | ||
---|---|---|
3D human pose estimation has achieved much progress with the development of convolution neural networks. There still have some challenges to accurately estimate 3D joint locations from single-view images or videos due to depth ambiguity and severe occlusion. Motivated by the effectiveness of introducing vision transformer into computer vision tasks, we present a novel U-shaped spatial–temporal transformer-based network (U-STN) for 3D human pose estimation. The core idea of the proposed method is to process the human joints by designing a multi-scale and multi-level U-shaped transformer model. We construct a multi-scale architecture with three different scales based on the human skeletal topology, in which the local and global features are processed through three different scales with kinematic constraints. Furthermore, a multi-level feature representations is introduced by fusing intermediate features from different depths of the U-shaped network. With a skeletal constrained pooling and unpooling operations devised for U-STN, the network can transform features across different scales and extract meaningful semantic features at all levels. Experiments on two challenging benchmark datasets show that the proposed method achieves a good performance on 2D-to-3D pose estimation. The code is available at
https://github.com/l-fay/Pose3D
. |
Year | DOI | Venue |
---|---|---|
2022 | 10.1007/s00138-022-01334-6 | Machine Vision and Applications |
Keywords | DocType | Volume |
Human pose estimation, Spatial–temporal transformer network, Multi-scale and multi-level feature representations | Journal | 33 |
Issue | ISSN | Citations |
6 | 0932-8092 | 0 |
PageRank | References | Authors |
0.34 | 5 | 4 |
Name | Order | Citations | PageRank |
---|---|---|---|
Yang Honghong | 1 | 0 | 0.34 |
Guo Longfei | 2 | 0 | 0.34 |
Yumei Zhang | 3 | 10 | 7.91 |
Xiaojun Wu | 4 | 356 | 52.89 |