Abstract | ||
---|---|---|
Recent progress in multiple object tracking (MOT) has shown that a robust similarity score is key to the success of trackers. A good similarity score is expected to reflect multiple cues, e.g. appearance, location, and topology, over a long period of time. However, these cues are heterogeneous, making them hard to be combined in a unified network. As a result, existing methods usually encode them in separate networks or require a complex training approach. In this paper, we present a unified framework for similarity measurement between a tracklet and an object, which simultaneously encode various cues across time. We show a crucial principle to achieve this unified framework is the design of compatible feature representation for different cues and different sources (tracklet and object). A key technique behind this principle is a spatial-temporal relation module, which jointly models appearance and topology, and makes tracklet and object features compatible. The resulting method, named spatial-temporal relation networks (STRN), runs in a feed-forward way and can be trained in an end-to-end manner. The state-of-the-art accuracy was achieved on all of the MOT15 similar to 17 benchmarks using public detection and online settings. |
Year | DOI | Venue |
---|---|---|
2019 | 10.1109/ICCV.2019.00409 | 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019) |
Field | DocType | Volume |
BitTorrent tracker,ENCODE,Pattern recognition,Computer science,Video tracking,Artificial intelligence | Journal | abs/1904.11489 |
Issue | ISSN | Citations |
1 | 1550-5499 | 8 |
PageRank | References | Authors |
0.44 | 0 | 4 |
Name | Order | Citations | PageRank |
---|---|---|---|
Jiarui Xu | 1 | 9 | 2.48 |
Yue Cao | 2 | 574 | 21.49 |
Zheng Zhang | 3 | 436 | 15.48 |
Han Hu | 4 | 179 | 5.53 |