个人理解

**为什么：**尽管已有的MOT方法取得了成功，但它们也存在常见的问题，例如全局或局部不一致，鲁棒性和模型复杂性之间的权衡不良，以及在同一视频中的不同场景中缺乏灵活性
**创新点：**设计了一个一致的跟踪器，通过预测和关联视频序列中相邻两帧中的相同对象来隐式地执行跟踪

一、摘要

Multi-object tracking (MOT) is a challenging vision task that aims to detect individual objects within a single frame and associate them across multiple frames. Recent MOT approaches can be categorized into two-stage tracking-bydetection (TBD) methods and one-stage joint detection and tracking (JDT) methods. Despite the success of these approaches, they also suffer from common problems, such as harmful global or local inconsistency, poor trade-off between robustness and model complexity, and lack of flexibility in different scenes within the same video. In this paper we propose a simple but robust framework that formulates object detection and association jointly as a consistent denoising diffusion process from paired noise boxes to paired ground-truth boxes. This novel progressive denoising diffusion strategy substantially augments the tracker’s effectiveness, enabling it to discriminate between various objects. During the training stage, paired object boxes diffuse from paired ground-truth boxes to random distribution, and the model learns detection and tracking simultaneously by reversing this noising process. In inference, the model refines a set of paired randomly generated boxes to the detection and tracking results in a flexible one-step or multistep denoising diffusion process. Extensive experiments on three widely used MOT benchmarks, including MOT17, MOT20, and Dancetrack, demonstrate that our approach achieves competitive performance compared to the current state-of-the-art methods.

尽管已有的MOT方法取得了成功，但它们也存在常见的问题，例如有害的全局或局部不一致，鲁棒性和模型复杂性之间的权衡不良，以及在同一视频中的不同场景中缺乏灵活性
本文提出了一个简单但鲁棒的框架，该框架将目标检测和关联共同表述为从成对噪声盒到成对地面真值盒的一致去噪扩散过程
- 在训练阶段，配对的目标盒从配对的真值盒扩散到随机分布，模型通过反转这一噪声过程同时学习检测和跟踪。
- 在推理阶段，模型将一组成对随机生成的盒子细化为检测和跟踪结果，采用灵活的一步或多步去噪扩散过程。

个人理解

一、摘要

二、Method