个人理解

**创新点：**提供了一种新的轨迹生成方式：输入短序列帧，得到所有目标对应的的全局轨迹。
**为什么：**已有的MOT算法大致根据 data association 的方式可分为 Local Tracker 和 Global Tracker
- Local Tracker 存在的缺点：缺乏整体轨迹的明确模型，当遇到严重的遮挡或强烈的外观变化时，该类算法的性能会大幅度下降。
- Global Tracker 存在的缺点：速度较慢，通常与检测器分离
本文算法的优缺点：
- 优点：
  - 当出现遮挡时，该类算法的鲁棒性较好
- 缺点
  - 当不同目标的外观非常的相似时，该类方法的性能应该会较的差

一、摘要

We present a novel transformer-based architecture for global multi-object tracking. Our network takes a short sequence of frames as input and produces global trajectories for all objects. The core component is a global tracking transformer that operates on objects from all frames in the sequence. The transformer encodes object features from all frames, and uses trajectory queries to group them into trajectories. The trajectory queries are object features from a single frame and naturally produce unique trajectories. Our global tracking transformer does not require intermediate pairwise grouping or combinatorial association, and can be jointly trained with an object detector. It achieves competitive performance on the popular MOT17 benchmark, with 75.3 MOTA and 59.1 HOTA. More importantly, our framework seamlessly integrates into stateof-the-art large-vocabulary detectors to track any objects. Experiments on the challenging TAO dataset show that our framework consistently improves upon baselines that are based on pairwise association, outperforming published work by a significant 7.7 tracking mAP.

提出了一种基于 Transfomer 的全局多目标跟踪框架
- **输入：**短时间的帧序列（短视频）
- **输出：**视频中所有对象的全局轨迹
该框架的核心组件是 **global tracking transformer：**它对输入的视频序列中所有帧的对象进行操作
- 对所有帧中的对象特征进行编码，并使用轨迹查询（trajectory queries）将它们分组，归入到每个轨迹中
- 轨迹查询（trajectory queries）是来自单个帧的对象特征，并且每个轨迹查询都自然地产生各自的轨迹
- 注意 global tracking transformer 不需要中间的成对分组或组合关联，并且可以与对象检测器联合训练