Abstract | ||
---|---|---|
Temporal action localization, detecting actions in untrimmed videos, is widely studied by anchor-based approaches that first generate excessive action proposals,
<italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">i.e.</i>
, temporal windows, then evaluate and classify these proposals. To reduce the number of action proposals, recent studies use an anchor-free approach that leverages each time point rather than a temporal window to represent an action instance. However, this point representation, usually modeled by temporal convolutions, may have the fixed and limited receptive field to detect an entire action. So we propose an Actionness-guided Transformer (Ag-Trans) model to learn representations for each point proposal. Ag-Trans first predicts the actionness,
<italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">i.e.</i>
, time sequences of the action starting, continuing, and ending phases, then the corresponding action phase can be embedded to model the point representation. Experimental results show that the Ag-Trans model outperforms the CNN-based model under the same experiment settings, especially for long-duration actions. |
Year | DOI | Venue |
---|---|---|
2022 | 10.1109/LSP.2021.3132287 | IEEE Signal Processing Letters |
Keywords | DocType | Volume |
Temporal action localization,anchor-free,transformer | Journal | 29 |
ISSN | Citations | PageRank |
1070-9908 | 0 | 0.34 |
References | Authors | |
8 | 4 |
Name | Order | Citations | PageRank |
---|---|---|---|
Peisen Zhao | 1 | 0 | 2.03 |
Ling-Xi Xie | 2 | 429 | 37.79 |
Ya Zhang | 3 | 1340 | 91.72 |
Qi Tian | 4 | 6443 | 331.75 |