Title
U-shaped spatial–temporal transformer network for 3D human pose estimation
Abstract
3D human pose estimation has achieved much progress with the development of convolution neural networks. There still have some challenges to accurately estimate 3D joint locations from single-view images or videos due to depth ambiguity and severe occlusion. Motivated by the effectiveness of introducing vision transformer into computer vision tasks, we present a novel U-shaped spatial–temporal transformer-based network (U-STN) for 3D human pose estimation. The core idea of the proposed method is to process the human joints by designing a multi-scale and multi-level U-shaped transformer model. We construct a multi-scale architecture with three different scales based on the human skeletal topology, in which the local and global features are processed through three different scales with kinematic constraints. Furthermore, a multi-level feature representations is introduced by fusing intermediate features from different depths of the U-shaped network. With a skeletal constrained pooling and unpooling operations devised for U-STN, the network can transform features across different scales and extract meaningful semantic features at all levels. Experiments on two challenging benchmark datasets show that the proposed method achieves a good performance on 2D-to-3D pose estimation. The code is available at https://github.com/l-fay/Pose3D .
Year
DOI
Venue
2022
10.1007/s00138-022-01334-6
Machine Vision and Applications
Keywords
DocType
Volume
Human pose estimation, Spatial–temporal transformer network, Multi-scale and multi-level feature representations
Journal
33
Issue
ISSN
Citations 
6
0932-8092
0
PageRank 
References 
Authors
0.34
5
4
Name
Order
Citations
PageRank
Yang Honghong100.34
Guo Longfei200.34
Yumei Zhang3107.91
Xiaojun Wu435652.89