个人理解

创新点：
- 出了一个完整的端到端MOT框架，称为MOTR。MOTR可以隐式地联合学习外观和位置的差异
- 将MOT问题表述为序列集预测问题。我们从当前帧之前的隐藏状态生成轨迹查询，用于迭代更新和预测。
- 提出了 tracklet-aware label assignment，用于轨道查询和对象之间的一对一分配。
- 引入了 entrance and exit mechanism 来处理新生和终止的轨道。
- 进一步提出了CAL和TAN来增强时间建模

一、摘要

   Temporal modeling of objects is a key challenge in multipleobject tracking (MOT). Existing methods track by associating detections through motion-based and appearance-based similarity heuristics. The post-processing nature of association prevents end-to-end exploitation of temporal variations in video sequence.
   In this paper, we propose MOTR, which extends DETR [6] and introduces “track query” to model the tracked instances in the entire video. Track query is transferred and updated frame-by-frame to perform iterative prediction over time. We propose tracklet-aware label assignment to train track queries and newborn object queries. We further propose temporal aggregation network and collective average loss to enhance temporal relation modeling. Experimental results on DanceTrack show that MOTR significantly outperforms state-of-the-art method, ByteTrack [42] by 6.5% on HOTA metric. On MOT17, MOTR outperforms our concurrent works, TrackFormer [18] and TransTrack [29], on association performance. MOTR can serve as a stronger baseline for future research on temporal modeling and Transformer-based trackers.

多目标跟踪的关键挑战在于轨迹上目标的时序建模，而现有的TBD方法大多采用简单的启发式策略，如空间和外观相似度。尽管这些方法具有通用性，但它们过于简单，不足以对复杂的变化进行建模，例如在遮挡情况下进行跟踪。 本质上，现有方法缺乏时间建模的能力。
这篇论文中，作者提出了MOTR，这是一个真正的完全端到端的跟踪框架。MOTR能够学习建模目标的长程时间变化，它隐式地进行时间关联，并避免了以前的显式启发式策略。
- 基于Transformer和DETR，MOTR引入了track query这个概念，一个track query负责建模一个目标的整个轨迹，它可以在帧间传输并更新从而完成目标检测和跟踪任务。
- 提出 traclet-aware label assignment 用于训练 track queries 和 newborn object queries。
- **时间聚合网络（temporal aggregation network，TAN）**配合多帧训练被用来建模长程时间关系。
- 使用 集体平均损失(collective average loss) 增强时间关系建模

二、受DETR启发

Untitled