Citation: | Xue Wanli, Zhang Zhibin, Pei Shenglei, Zhang Kaihua, Chen Shengyong. Mixing Tokens from Target and Search Regions for Visual Object Tracking[J]. Journal of Computer Research and Development, 2024, 61(2): 460-469. DOI: 10.7544/issn1000-1239.202220698 |
There are three problems about feature extraction and fusion in the current mainstream tracking framework based on Transformer: 1. The two modules of feature extraction and fusion are used separately, which is easy to produce sub-optimal model training results. 2. Computational complexity of O(N2) using self-attention reduces tracking efficiency. 3. The target template selection strategy is simple and is difficult to adapt to the drastic changes in the target appearance during the tracking process. We propose a novel Transformer tracking framework using fast Fourier transform mixing target tokens and search region tokens. For problem 1, an efficient end-to-end approach is proposed to extract and fuse features for unified learning to obtain optimal model; For problem 2, the fast Fourier transform is used to achieve complete information interaction between the target tokens and search region tokens. The computational complexity of this operation is O(Nlog(N)), which greatly improves the tracing efficiency. For problem 3, a template memory storage mechanism based on quality assessment is proposed, which can quickly adapt to the drastic changes in target appearance. Compared with the current state-of-the-art algorithms on three datasets LaSOT, OTB100 and UAV123, our tracker achieves better performance in both efficiency and accuracy.
[1] |
李玺,查宇飞,张天柱,等. 深度学习的目标跟踪算法综述[J]. 中国图象图形学报,2019,24(12):2057−2080
Li Xi, Cha Yufei, Zhang Tianzhu, et al. Survey of visual object tracking algorithms based on deep learning[J]. Journal of Image and Graphics, 2019, 24(12): 2057−2080 (in Chinese)
|
[2] |
柳培忠,汪鸿翔,骆炎民,等. 一种结合时空上下文的在线卷积网络跟踪算法[J]. 计算机研究与发展,2018,55(12):2785−2793
Liu Peizhong, Wang Hongxiang, Luo Yanmin, et al. Visual tracking algorithm based on adaptive spatial regularization[J]. Journal of Computer Research and Development, 2018, 55(12): 2785−2793 (in Chinese)
|
[3] |
Rao Yongming, Zhao Wenliang, Zhu Zheng, et al. Global filter networks for image classification[C] //Proc of the 35th Advances in Neural Information Processing Systems. Cambridge, MA: MIT, 2021: 980−993
|
[4] |
Zhang Zhipeng, Peng Houwen. Deeper and wider Siamese networks for real-time visual tracking[C] //Proc of the 32nd IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2019: 4591−4600
|
[5] |
Li Bo, Wu Wei, Wang Qiang, et al. Evolution of Siamese visual tracking with very deep networks[C] //Proc of the 32nd IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2019: 16−20
|
[6] |
Li Bo, Yan Junjie, Wu Wei, et al. High performance visual tracking with Siamese region proposal network[C] //Proc of the 31st IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2018: 8971−8980
|
[7] |
Bertinetto L, Valmadre J, Henriques J F, et al. Fully-convolutional siamese networks for object tracking[C] //Proc of the 14th European Conf on Computer Vision. Berlin: Springer, 2016: 850−865
|
[8] |
Bhat G, Danelljan M, Gool L V, et al. Learning discriminative model prediction for tracking[C] //Proc of the 17th IEEE/CVF Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2019: 6182−6191
|
[9] |
Danelljan M, Gool L V, Timofte R. Probabilistic regression for visual tracking[C] //Proc of the 33rd IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2020: 7183−7192
|
[10] |
Danelljan M, Bhat G, Khan F S, et al. Atom: Accurate tracking by overlap maximization[C] //Proc of the 32nd IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2019: 4660−4669
|
[11] |
Yan Bin, Peng Houwen, Fu Jianlong, et al. Learning spatio-temporal transformer for visual tracking[C] //Proc of the 18th IEEE/CVF Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2021: 10448-10457
|
[12] |
Chen Xin, Yan Bin, Zhu Jiawen, et al. Transformer tracking[C] //Proc of the 34th IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2021: 8126−8135
|
[13] |
Wang Ning, Zhou Wengang, Wang Jie, et al. Transformer meets tracker: Exploiting temporal context for robust visual tracking[C] //Proc of the 34th IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2021: 1571−1580
|
[14] |
Lin Liting, Fan Heng, Xu Yong, et al. SwinTrack: A simple and strong baseline for transformer tracking[J]. arXiv preprint, arXiv: 2112. 00995, 2021
|
[15] |
Cui Yutao, Cheng Jiang, Wang Liming, et al. Mixformer: End-to-end tracking with iterative mixed attention[C] //Proc of the 35th IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2022: 13608−13618
|
[16] |
Tolstikhin I, Houlsby N, Kolesnikov A, et al. MLP-mixer: An all-MLP architecture for vision[J]. arXiv preprint, arXiv: 2105. 01601, 2021
|
[17] |
Touvron H, Bojanowski P, Caron M, et al. ResMLP: Feedforward networks for image classification with data-efficient training[J]. arXiv preprint, arXiv: 2105. 03404, 2021
|
[18] |
Fan Heng, Lin Liting, Yang Fan, et al. LaSOT: A high-quality benchmark for large-scale single object tracking[C] //Proc of the 32nd IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2019: 5374−5383
|
[19] |
Wu Yi, Lim J, Yang Ming-Hsuan. Online object tracking: A benchmark[C] //Proc of the 26th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2013: 2411−2418
|
[20] |
Mueller M, Smith N, Ghanem B. A benchmark and simulator for UAV tracking[C] //Proc of the 14th European Conf on Computer Vision. Berlin: Springer, 2016: 445−461
|
[21] |
Guo Qing, Feng Wei, Zhou Ce, et al. Learning dynamic siamese network for visual object tracking[C] //Proc of the 16th IEEE Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2017: 1763−1771
|
[22] |
Liu Ze, Lin Yutong, Cao Yue, et al. Swin Transformer: hierarchical vision transformer using shifted windows[C] // Proc of the 18th IEEE Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2021: 10012−10022
|
[23] |
Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16x16 words: Transformers for image recognition at scale[J]. arXiv preprint, arXiv: 2010. 11929, 2020
|
[24] |
Wu Haiping, Xiao Bin, Codella N, et al. CVT: Introducing convolutions to vision transformers[C] // Proc of the 18th IEEE Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2021: 22-31
|
[25] |
He Kaiming, Zhang Xiangyu, Ren Shaoqing, et al. Deep residual learning for image recognition[C] //Proc of the 29th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2016: 770−778
|
[26] |
Mayer C, Danelljan M, Bhat G, et al. Transforming model prediction for tracking[J]. arXiv preprint, arXiv: 2203. 11192, 2022
|
[27] |
Liu Hanxiao, Dai Zihang, So D, et al. Pay attention to MLPs[C] //Proc of the 35th Advances in Neural Information Processing Systems. Cambridge, MA: MIT, 2021: 9204−9215
|
[28] |
Bolme D S, Beveridge J R, Draper B A, et al. Visual object tracking using adaptive correlation filters[C] //Proc of the 23rd IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2010: 2544−2550
|
[29] |
Wang Mengmeng, Liu Yong, Huang Zeyi. Large margin object tracking with circulant feature maps[C] //Proc of the 30th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2017: 4021−4029
|
[30] |
Huang Lianghua, Zhao Xin, Huang Kaiqi. GOT-10k: A large high-diversity benchmark for generic object tracking in the wild[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 43(5): 1562−1577
|
[31] |
Muller M, Bibi A, Giancola S, et al. TrackingNet: A large-scale dataset and benchmark for object tracking in the wild[C] //Proc of the 15th European Conf on Computer Vision. Berlin: Springer, 2018: 300−317
|
[32] |
Kingma D P, Jimmy B. Adam: A method for stochastic optimization[J]. arXiv preprint, arXiv: 1412. 6980, 2014
|
[33] |
Yan Bin, Zhang Xinyu, Wang Dong, et al. Alpha-Refine: Boosting tracking performance by precise bounding box estimation[C] //Proc of the 34th IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2021: 5289−5298
|
[34] |
Voigtlaender P, Luiten J, Torr P H S, et al. SiamR-CNN: Visual tracking by re-detection[C] //Proc of the 33rd IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2020: 6578−6588
|
[35] |
Guo Dongyan, Shao Yanyan, Cui Ying, et al. Graph attention tracking[C] //Proc of the 34th IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2021: 9543−9552
|
[36] |
Chen Zedu, Zhong Bineng, Li Guorong, et al. Siamese box adaptive network for visual tracking[C] //Proc of the 33rd IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2020: 6668−6677
|
[37] |
Danelljan M, Bhat G, Shahbaz K F, et al. ECO: Efficient convolution operators for tracking[C] //Proc of the 30th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2017: 6638−6646
|
[38] |
Nam H, Han B. Learning multi-domain convolutional neural networks for visual tracking[C] //Proc of the 29th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2016: 4293−4302
|
[1] | Fu Maozhong, Hu Haiyang, Li Zhongjin. Dynamic Resource Scheduling Method for GPU Cluster[J]. Journal of Computer Research and Development, 2023, 60(6): 1308-1321. DOI: 10.7544/issn1000-1239.202220149 |
[2] | Wang Ziyi, Hu Xiaoyu, Wang Xin, Zhang Xinggong, Cao Zhen, Zheng Kai, Cui Yong. Fairness Measurement and Algorithm Design of Network Transmission: A Case Study of Video Applications[J]. Journal of Computer Research and Development, 2023, 60(4): 810-827. DOI: 10.7544/issn1000-1239.202330022 |
[3] | Zhong Lujie, Wang Mu. Blockchain-Enpowered Cooperative Resource Allocation Scheme for Computing First Network[J]. Journal of Computer Research and Development, 2023, 60(4): 750-762. DOI: 10.7544/issn1000-1239.202330002 |
[4] | Fang Rongqiang, Wang Jing, Yao Zhicheng, Liu Chang, Zhang Weigong. Modeling Computational Feature of Multi-Layer Neural Network[J]. Journal of Computer Research and Development, 2019, 56(6): 1170-1181. DOI: 10.7544/issn1000-1239.2019.20190111 |
[5] | Xu Hongzhi, Li Renfa, Zeng Lining. Parallel Task Scheduling for Resource Consumption Minimization with Reliability Constraint[J]. Journal of Computer Research and Development, 2018, 55(11): 2569-2583. DOI: 10.7544/issn1000-1239.2018.20170893 |
[6] | Xu Ran, Wang Wendong, Gong Xiangyang, Que Xirong. Delay-Aware Resource Scheduling Optimization in Network Function Virtualization[J]. Journal of Computer Research and Development, 2018, 55(4): 738-747. DOI: 10.7544/issn1000-1239.2018.20170926 |
[7] | WeiWei, LiuYang, YangWeidong. A Fast Approximation Algorithm for the General Resource Placement Problem in Cloud Computing Platform[J]. Journal of Computer Research and Development, 2016, 53(3): 697-703. DOI: 10.7544/issn1000-1239.2016.20148323 |
[8] | Fan Pengyi, Wang Hui, Jiang Zhihong, and Li Pei. Measurement of Microblogging Network[J]. Journal of Computer Research and Development, 2012, 49(4): 691-699. |
[9] | Xie Yingke, Wang Jiandong, Zhu Chao, Zhao Zili, Han Chengde. High Precision Timestamps in Network Measurement[J]. Journal of Computer Research and Development, 2010, 47(12). |
[10] | Jiao Jian, Yao Shan, Li Xiaojian. Research on Network Bidirectional Topology Discovery Based on Measurer by Spreading[J]. Journal of Computer Research and Development, 2010, 47(5): 903-910. |