Abstract | ||
---|---|---|
Action Quality Assessment(AQA) is important for action understanding and resolving the task poses unique challenges due to subtle visual differences. Existing state-of-the-art methods typically rely on the holistic video representations for score regression or ranking, which limits the generalization to capture fine-grained intra-class variation. To overcome the above limitation, we propose a temporal parsing transformer to decompose the holistic feature into temporal part-level representations. Specifically, we utilize a set of learnable queries to represent the atomic temporal patterns for a specific action. Our decoding process converts the frame representations to a fixed number of temporally ordered part representations. To obtain the quality score, we adopt the state-of-the-art contrastive regression based on the part representations. Since existing AQA datasets do not provide temporal part-level labels or partitions, we propose two novel loss functions on the cross attention responses of the decoder: a ranking loss to ensure the learnable queries to satisfy the temporal order in cross attention and a sparsity loss to encourage the part representations to be more discriminative. Extensive experiments show that our proposed method outperforms prior work on three public AQA benchmarks by a considerable margin. |
Year | DOI | Venue |
---|---|---|
2022 | 10.1007/978-3-031-19772-7_25 | European Conference on Computer Vision |
Keywords | DocType | Citations |
Action quality assessment,Temporal parsing transformer,Temporal patterns,Contrastive regression | Conference | 0 |
PageRank | References | Authors |
0.34 | 0 | 8 |
Name | Order | Citations | PageRank |
---|---|---|---|
Yang Bai | 1 | 68 | 24.51 |
Desen Zhou | 2 | 0 | 0.34 |
Songyang Zhang | 3 | 0 | 0.34 |
Jian Wang | 4 | 7 | 6.40 |
Er-rui Ding | 5 | 142 | 29.31 |
Yu Guan | 6 | 195 | 22.59 |
Long Yang | 7 | 114 | 15.79 |
Jingdong Wang | 8 | 0 | 1.35 |