Title
Hierarchically Self-supervised Transformer for Human Skeleton Representation Learning.
Abstract
Despite the success of fully-supervised human skeleton sequence modeling, utilizing self-supervised pre-training for skeleton sequence representation learning has been an active field because acquiring task-specific skeleton annotations at large scales is difficult. Recent studies focus on learning video-level temporal and discriminative information using contrastive learning, but overlook the hierarchical spatial-temporal nature of human skeletons. Different from such superficial supervision at the video level, we propose a self-supervised hierarchical pre-training scheme incorporated into a hierarchical Transformer-based skeleton sequence encoder (Hi-TRS), to explicitly capture spatial, short-term, and long-term temporal dependencies at frame, clip, and video levels, respectively. To evaluate the proposed self-supervised pre-training scheme with Hi-TRS, we conduct extensive experiments covering three skeleton-based downstream tasks including action recognition, action detection, and motion prediction. Under both supervised and semi-supervised evaluation protocols, our method achieves the state-of-the-art performance. Additionally, we demonstrate that the prior knowledge learned by our model in the pre-training stage has strong transfer capability for different downstream tasks. The source code can be found at https://github.com/yuxiaochen1103/Hi-TRS.
Year
DOI
Venue
2022
10.1007/978-3-031-19809-0_11
European Conference on Computer Vision
Keywords
DocType
Citations 
Skeleton representation learning,Self-supervised learning,Action recognition,Action detection,Motion prediction
Conference
0
PageRank 
References 
Authors
0.34
0
8
Name
Order
Citations
PageRank
Yuxiao Chen100.34
Long Zhao2306.23
Jianbo Yuan300.34
Yu Tian44919.62
Zhaoyang Xia500.68
Shijie Geng6286.62
Ligong Han752.44
Dimitris N. Metaxas88834952.25