Mixing Tokens from Target and Search Regions for Visual Object Tracking

Xue Wanli; Zhang Zhibin; Pei Shenglei; Zhang Kaihua; Chen Shengyong

doi:10.7544/issn1000-1239.202220698

Journal of Computer Research and Development > 2024 > 61(2): 460-469. > DOI: 10.7544/issn1000-1239.202220698 CSTR: 32373.14.issn1000-1239.202220698

Xue Wanli, Zhang Zhibin, Pei Shenglei, Zhang Kaihua, Chen Shengyong. Mixing Tokens from Target and Search Regions for Visual Object Tracking[J]. Journal of Computer Research and Development, 2024, 61(2): 460-469. DOI: 10.7544/issn1000-1239.202220698

Citation:

PDF (3525 KB)

Mixing Tokens from Target and Search Regions for Visual Object Tracking

1.
School of Computer Science and Engineering, Tianjin University of Technology, Tianjin 300384
2.
College of Physics and Electronic Information Engineering, Qinghai Minzu University, Xining 810007
3.
School of Computer Science, Nanjing University of Information Science & Technology, Nanjing 130012

Funds: This work was supported by the National Natural Science Foundation of China (62376197, 61906135, 61876088, 92048301, 62020106004) and the 333 High-level Talents Cultivation of Jiangsu Province (BRA2020291).

More Information

Author Bio:
Xue Wanli: born in 1986. PhD, associate professor, master supervisor. Member of CCF and CSIG. His main research interests include visual tracking, sign language recognition, and image stitching

Zhang Zhibin: born in 1996. PhD candidate. His main research interests include visual object tracking and deep learning

Pei Shenglei: born in 1980. PhD, professor, master supervisor. His main research interests include machine learning, data mining, and intelligent decision system

Zhang Kaihua: born in 1983. PhD, professor. His main research interests include video object segmentation and visual object tracking

Chen Shengyong: born in 1973. PhD, professor. His main research interests include computer vision and machine learning
Received Date: August 07, 2022
Revised Date: March 12, 2023
Available Online: November 09, 2023

Graphical Abstract

Abstract

Abstract

There are three problems about feature extraction and fusion in the current mainstream tracking framework based on Transformer: 1. The two modules of feature extraction and fusion are used separately, which is easy to produce sub-optimal model training results. 2. Computational complexity of $O\left({N}^{2}\right)$ using self-attention reduces tracking efficiency. 3. The target template selection strategy is simple and is difficult to adapt to the drastic changes in the target appearance during the tracking process. We propose a novel Transformer tracking framework using fast Fourier transform mixing target tokens and search region tokens. For problem 1, an efficient end-to-end approach is proposed to extract and fuse features for unified learning to obtain optimal model; For problem 2, the fast Fourier transform is used to achieve complete information interaction between the target tokens and search region tokens. The computational complexity of this operation is $O\left(N\mathrm{l}\mathrm{o}\mathrm{g}\left(N\right)\right)$ , which greatly improves the tracing efficiency. For problem 3, a template memory storage mechanism based on quality assessment is proposed, which can quickly adapt to the drastic changes in target appearance. Compared with the current state-of-the-art algorithms on three datasets LaSOT, OTB100 and UAV123, our tracker achieves better performance in both efficiency and accuracy.
- Transformer,
- fast Fourier transform,
- feature extraction,
- feature fusion,
- object tracking

FullText(HTML)

References (38)

References

[1]	李玺,查宇飞,张天柱,等. 深度学习的目标跟踪算法综述[J]. 中国图象图形学报,2019,24(12):2057−2080 Li Xi, Cha Yufei, Zhang Tianzhu, et al. Survey of visual object tracking algorithms based on deep learning[J]. Journal of Image and Graphics, 2019, 24(12): 2057−2080 (in Chinese)
[2]	柳培忠,汪鸿翔,骆炎民,等. 一种结合时空上下文的在线卷积网络跟踪算法[J]. 计算机研究与发展,2018,55(12):2785−2793 Liu Peizhong, Wang Hongxiang, Luo Yanmin, et al. Visual tracking algorithm based on adaptive spatial regularization[J]. Journal of Computer Research and Development, 2018, 55(12): 2785−2793 (in Chinese)
[3]	Rao Yongming, Zhao Wenliang, Zhu Zheng, et al. Global filter networks for image classification[C] //Proc of the 35th Advances in Neural Information Processing Systems. Cambridge, MA: MIT, 2021: 980−993
[4]	Zhang Zhipeng, Peng Houwen. Deeper and wider Siamese networks for real-time visual tracking[C] //Proc of the 32nd IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2019: 4591−4600
[5]	Li Bo, Wu Wei, Wang Qiang, et al. Evolution of Siamese visual tracking with very deep networks[C] //Proc of the 32nd IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2019: 16−20
[6]	Li Bo, Yan Junjie, Wu Wei, et al. High performance visual tracking with Siamese region proposal network[C] //Proc of the 31st IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2018: 8971−8980
[7]	Bertinetto L, Valmadre J, Henriques J F, et al. Fully-convolutional siamese networks for object tracking[C] //Proc of the 14th European Conf on Computer Vision. Berlin: Springer, 2016: 850−865
[8]	Bhat G, Danelljan M, Gool L V, et al. Learning discriminative model prediction for tracking[C] //Proc of the 17th IEEE/CVF Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2019: 6182−6191
[9]	Danelljan M, Gool L V, Timofte R. Probabilistic regression for visual tracking[C] //Proc of the 33rd IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2020: 7183−7192
[10]	Danelljan M, Bhat G, Khan F S, et al. Atom: Accurate tracking by overlap maximization[C] //Proc of the 32nd IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2019: 4660−4669
[11]	Yan Bin, Peng Houwen, Fu Jianlong, et al. Learning spatio-temporal transformer for visual tracking[C] //Proc of the 18th IEEE/CVF Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2021: 10448-10457
[12]	Chen Xin, Yan Bin, Zhu Jiawen, et al. Transformer tracking[C] //Proc of the 34th IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2021: 8126−8135
[13]	Wang Ning, Zhou Wengang, Wang Jie, et al. Transformer meets tracker: Exploiting temporal context for robust visual tracking[C] //Proc of the 34th IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2021: 1571−1580
[14]	Lin Liting, Fan Heng, Xu Yong, et al. SwinTrack: A simple and strong baseline for transformer tracking[J]. arXiv preprint, arXiv: 2112. 00995, 2021
[15]	Cui Yutao, Cheng Jiang, Wang Liming, et al. Mixformer: End-to-end tracking with iterative mixed attention[C] //Proc of the 35th IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2022: 13608−13618
[16]	Tolstikhin I, Houlsby N, Kolesnikov A, et al. MLP-mixer: An all-MLP architecture for vision[J]. arXiv preprint, arXiv: 2105. 01601, 2021
[17]	Touvron H, Bojanowski P, Caron M, et al. ResMLP: Feedforward networks for image classification with data-efficient training[J]. arXiv preprint, arXiv: 2105. 03404, 2021
[18]	Fan Heng, Lin Liting, Yang Fan, et al. LaSOT: A high-quality benchmark for large-scale single object tracking[C] //Proc of the 32nd IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2019: 5374−5383
[19]	Wu Yi, Lim J, Yang Ming-Hsuan. Online object tracking: A benchmark[C] //Proc of the 26th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2013: 2411−2418
[20]	Mueller M, Smith N, Ghanem B. A benchmark and simulator for UAV tracking[C] //Proc of the 14th European Conf on Computer Vision. Berlin: Springer, 2016: 445−461
[21]	Guo Qing, Feng Wei, Zhou Ce, et al. Learning dynamic siamese network for visual object tracking[C] //Proc of the 16th IEEE Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2017: 1763−1771
[22]	Liu Ze, Lin Yutong, Cao Yue, et al. Swin Transformer: hierarchical vision transformer using shifted windows[C] // Proc of the 18th IEEE Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2021: 10012−10022
[23]	Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16x16 words: Transformers for image recognition at scale[J]. arXiv preprint, arXiv: 2010. 11929, 2020
[24]	Wu Haiping, Xiao Bin, Codella N, et al. CVT: Introducing convolutions to vision transformers[C] // Proc of the 18th IEEE Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2021: 22-31
[25]	He Kaiming, Zhang Xiangyu, Ren Shaoqing, et al. Deep residual learning for image recognition[C] //Proc of the 29th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2016: 770−778
[26]	Mayer C, Danelljan M, Bhat G, et al. Transforming model prediction for tracking[J]. arXiv preprint, arXiv: 2203. 11192, 2022
[27]	Liu Hanxiao, Dai Zihang, So D, et al. Pay attention to MLPs[C] //Proc of the 35th Advances in Neural Information Processing Systems. Cambridge, MA: MIT, 2021: 9204−9215
[28]	Bolme D S, Beveridge J R, Draper B A, et al. Visual object tracking using adaptive correlation filters[C] //Proc of the 23rd IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2010: 2544−2550
[29]	Wang Mengmeng, Liu Yong, Huang Zeyi. Large margin object tracking with circulant feature maps[C] //Proc of the 30th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2017: 4021−4029
[30]	Huang Lianghua, Zhao Xin, Huang Kaiqi. GOT-10k: A large high-diversity benchmark for generic object tracking in the wild[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 43(5): 1562−1577
[31]	Muller M, Bibi A, Giancola S, et al. TrackingNet: A large-scale dataset and benchmark for object tracking in the wild[C] //Proc of the 15th European Conf on Computer Vision. Berlin: Springer, 2018: 300−317
[32]	Kingma D P, Jimmy B. Adam: A method for stochastic optimization[J]. arXiv preprint, arXiv: 1412. 6980, 2014
[33]	Yan Bin, Zhang Xinyu, Wang Dong, et al. Alpha-Refine: Boosting tracking performance by precise bounding box estimation[C] //Proc of the 34th IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2021: 5289−5298
[34]	Voigtlaender P, Luiten J, Torr P H S, et al. SiamR-CNN: Visual tracking by re-detection[C] //Proc of the 33rd IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2020: 6578−6588
[35]	Guo Dongyan, Shao Yanyan, Cui Ying, et al. Graph attention tracking[C] //Proc of the 34th IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2021: 9543−9552
[36]	Chen Zedu, Zhong Bineng, Li Guorong, et al. Siamese box adaptive network for visual tracking[C] //Proc of the 33rd IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2020: 6668−6677
[37]	Danelljan M, Bhat G, Shahbaz K F, et al. ECO: Efficient convolution operators for tracking[C] //Proc of the 30th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2017: 6638−6646
[38]	Nam H, Han B. Learning multi-domain convolutional neural networks for visual tracking[C] //Proc of the 29th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2016: 4293−4302