混合目标与搜索区域令牌的视觉目标跟踪

薛万利; 张智彬; 裴生雷; 张开华; 陈胜勇

doi:10.7544/issn1000-1239.202220698

混合目标与搜索区域令牌的视觉目标跟踪

Mixing Tokens from Target and Search Regions for Visual Object Tracking

摘要

摘要: 当前基于Transformer的主流跟踪框架在特征提取及融合方面存在3个问题：1）分开进行特征提取与融合，易产生次优模型训练结果；2）使用计算复杂度为 O\left(N^2\right) 的自注意力机制会降低跟踪算法效率；3）简单的目标模板选取策略难以自适应跟踪过程中目标表观的剧烈变化. 为此，利用快速傅里叶变换对目标与搜索区域的令牌进行有效混合，提出一种新颖的基于Transformer的视觉目标跟踪方案. 针对问题1提出一种高效端到端方式将特征提取与融合进行统一学习以获得最优模型. 针对问题2采用快速傅里叶变换实现目标与搜索区域令牌之间的完全信息交互，该操作计算复杂度为 O\left(N\mathrml\mathrmo\mathrmg\left(N\right)\right) ，有助于提升跟踪效率. 针对问题3提出一种基于跟踪质量评估的目标模板记忆存储机制以快速自适应目标表观的剧烈变化. 在3个标准数据集LaSOT，OTB100，UAV123上，所提方法与当前最优方法相比在效率和精度上均取得更好表现.

Abstract: There are three problems about feature extraction and fusion in the current mainstream tracking framework based on Transformer: 1. The two modules of feature extraction and fusion are used separately, which is easy to produce sub-optimal model training results. 2. Computational complexity of O\left(N^2\right) using self-attention reduces tracking efficiency. 3. The target template selection strategy is simple and is difficult to adapt to the drastic changes in the target appearance during the tracking process. We propose a novel Transformer tracking framework using fast Fourier transform mixing target tokens and search region tokens. For problem 1, an efficient end-to-end approach is proposed to extract and fuse features for unified learning to obtain optimal model; For problem 2, the fast Fourier transform is used to achieve complete information interaction between the target tokens and search region tokens. The computational complexity of this operation is O\left(N\mathrml\mathrmo\mathrmg\left(N\right)\right) , which greatly improves the tracing efficiency. For problem 3, a template memory storage mechanism based on quality assessment is proposed, which can quickly adapt to the drastic changes in target appearance. Compared with the current state-of-the-art algorithms on three datasets LaSOT, OTB100 and UAV123, our tracker achieves better performance in both efficiency and accuracy.

HTML全文

参考文献(38)

施引文献

资源附件(0)