Processing math: 100%
  • 中国精品科技期刊
  • CCF推荐A类中文期刊
  • 计算领域高质量科技期刊T1类
Advanced Search
Xue Wanli, Zhang Zhibin, Pei Shenglei, Zhang Kaihua, Chen Shengyong. Mixing Tokens from Target and Search Regions for Visual Object Tracking[J]. Journal of Computer Research and Development, 2024, 61(2): 460-469. DOI: 10.7544/issn1000-1239.202220698
Citation: Xue Wanli, Zhang Zhibin, Pei Shenglei, Zhang Kaihua, Chen Shengyong. Mixing Tokens from Target and Search Regions for Visual Object Tracking[J]. Journal of Computer Research and Development, 2024, 61(2): 460-469. DOI: 10.7544/issn1000-1239.202220698

Mixing Tokens from Target and Search Regions for Visual Object Tracking

Funds: This work was supported by the National Natural Science Foundation of China (62376197, 61906135, 61876088, 92048301, 62020106004) and the 333 High-level Talents Cultivation of Jiangsu Province (BRA2020291).
More Information
  • Author Bio:

    Xue Wanli: born in 1986. PhD, associate professor, master supervisor. Member of CCF and CSIG. His main research interests include visual tracking, sign language recognition, and image stitching

    Zhang Zhibin: born in 1996. PhD candidate. His main research interests include visual object tracking and deep learning

    Pei Shenglei: born in 1980. PhD, professor, master supervisor. His main research interests include machine learning, data mining, and intelligent decision system

    Zhang Kaihua: born in 1983. PhD, professor. His main research interests include video object segmentation and visual object tracking

    Chen Shengyong: born in 1973. PhD, professor. His main research interests include computer vision and machine learning

  • Received Date: August 07, 2022
  • Revised Date: March 12, 2023
  • Available Online: November 09, 2023
  • There are three problems about feature extraction and fusion in the current mainstream tracking framework based on Transformer: 1. The two modules of feature extraction and fusion are used separately, which is easy to produce sub-optimal model training results. 2. Computational complexity of O(N2) using self-attention reduces tracking efficiency. 3. The target template selection strategy is simple and is difficult to adapt to the drastic changes in the target appearance during the tracking process. We propose a novel Transformer tracking framework using fast Fourier transform mixing target tokens and search region tokens. For problem 1, an efficient end-to-end approach is proposed to extract and fuse features for unified learning to obtain optimal model; For problem 2, the fast Fourier transform is used to achieve complete information interaction between the target tokens and search region tokens. The computational complexity of this operation is O(Nlog(N)), which greatly improves the tracing efficiency. For problem 3, a template memory storage mechanism based on quality assessment is proposed, which can quickly adapt to the drastic changes in target appearance. Compared with the current state-of-the-art algorithms on three datasets LaSOT, OTB100 and UAV123, our tracker achieves better performance in both efficiency and accuracy.

  • [1]
    李玺,查宇飞,张天柱,等. 深度学习的目标跟踪算法综述[J]. 中国图象图形学报,2019,24(12):2057−2080

    Li Xi, Cha Yufei, Zhang Tianzhu, et al. Survey of visual object tracking algorithms based on deep learning[J]. Journal of Image and Graphics, 2019, 24(12): 2057−2080 (in Chinese)
    [2]
    柳培忠,汪鸿翔,骆炎民,等. 一种结合时空上下文的在线卷积网络跟踪算法[J]. 计算机研究与发展,2018,55(12):2785−2793

    Liu Peizhong, Wang Hongxiang, Luo Yanmin, et al. Visual tracking algorithm based on adaptive spatial regularization[J]. Journal of Computer Research and Development, 2018, 55(12): 2785−2793 (in Chinese)
    [3]
    Rao Yongming, Zhao Wenliang, Zhu Zheng, et al. Global filter networks for image classification[C] //Proc of the 35th Advances in Neural Information Processing Systems. Cambridge, MA: MIT, 2021: 980−993
    [4]
    Zhang Zhipeng, Peng Houwen. Deeper and wider Siamese networks for real-time visual tracking[C] //Proc of the 32nd IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2019: 4591−4600
    [5]
    Li Bo, Wu Wei, Wang Qiang, et al. Evolution of Siamese visual tracking with very deep networks[C] //Proc of the 32nd IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2019: 16−20
    [6]
    Li Bo, Yan Junjie, Wu Wei, et al. High performance visual tracking with Siamese region proposal network[C] //Proc of the 31st IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2018: 8971−8980
    [7]
    Bertinetto L, Valmadre J, Henriques J F, et al. Fully-convolutional siamese networks for object tracking[C] //Proc of the 14th European Conf on Computer Vision. Berlin: Springer, 2016: 850−865
    [8]
    Bhat G, Danelljan M, Gool L V, et al. Learning discriminative model prediction for tracking[C] //Proc of the 17th IEEE/CVF Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2019: 6182−6191
    [9]
    Danelljan M, Gool L V, Timofte R. Probabilistic regression for visual tracking[C] //Proc of the 33rd IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2020: 7183−7192
    [10]
    Danelljan M, Bhat G, Khan F S, et al. Atom: Accurate tracking by overlap maximization[C] //Proc of the 32nd IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2019: 4660−4669
    [11]
    Yan Bin, Peng Houwen, Fu Jianlong, et al. Learning spatio-temporal transformer for visual tracking[C] //Proc of the 18th IEEE/CVF Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2021: 10448-10457
    [12]
    Chen Xin, Yan Bin, Zhu Jiawen, et al. Transformer tracking[C] //Proc of the 34th IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2021: 8126−8135
    [13]
    Wang Ning, Zhou Wengang, Wang Jie, et al. Transformer meets tracker: Exploiting temporal context for robust visual tracking[C] //Proc of the 34th IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2021: 1571−1580
    [14]
    Lin Liting, Fan Heng, Xu Yong, et al. SwinTrack: A simple and strong baseline for transformer tracking[J]. arXiv preprint, arXiv: 2112. 00995, 2021
    [15]
    Cui Yutao, Cheng Jiang, Wang Liming, et al. Mixformer: End-to-end tracking with iterative mixed attention[C] //Proc of the 35th IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2022: 13608−13618
    [16]
    Tolstikhin I, Houlsby N, Kolesnikov A, et al. MLP-mixer: An all-MLP architecture for vision[J]. arXiv preprint, arXiv: 2105. 01601, 2021
    [17]
    Touvron H, Bojanowski P, Caron M, et al. ResMLP: Feedforward networks for image classification with data-efficient training[J]. arXiv preprint, arXiv: 2105. 03404, 2021
    [18]
    Fan Heng, Lin Liting, Yang Fan, et al. LaSOT: A high-quality benchmark for large-scale single object tracking[C] //Proc of the 32nd IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2019: 5374−5383
    [19]
    Wu Yi, Lim J, Yang Ming-Hsuan. Online object tracking: A benchmark[C] //Proc of the 26th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2013: 2411−2418
    [20]
    Mueller M, Smith N, Ghanem B. A benchmark and simulator for UAV tracking[C] //Proc of the 14th European Conf on Computer Vision. Berlin: Springer, 2016: 445−461
    [21]
    Guo Qing, Feng Wei, Zhou Ce, et al. Learning dynamic siamese network for visual object tracking[C] //Proc of the 16th IEEE Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2017: 1763−1771
    [22]
    Liu Ze, Lin Yutong, Cao Yue, et al. Swin Transformer: hierarchical vision transformer using shifted windows[C] // Proc of the 18th IEEE Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2021: 10012−10022
    [23]
    Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16x16 words: Transformers for image recognition at scale[J]. arXiv preprint, arXiv: 2010. 11929, 2020
    [24]
    Wu Haiping, Xiao Bin, Codella N, et al. CVT: Introducing convolutions to vision transformers[C] // Proc of the 18th IEEE Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2021: 22-31
    [25]
    He Kaiming, Zhang Xiangyu, Ren Shaoqing, et al. Deep residual learning for image recognition[C] //Proc of the 29th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2016: 770−778
    [26]
    Mayer C, Danelljan M, Bhat G, et al. Transforming model prediction for tracking[J]. arXiv preprint, arXiv: 2203. 11192, 2022
    [27]
    Liu Hanxiao, Dai Zihang, So D, et al. Pay attention to MLPs[C] //Proc of the 35th Advances in Neural Information Processing Systems. Cambridge, MA: MIT, 2021: 9204−9215
    [28]
    Bolme D S, Beveridge J R, Draper B A, et al. Visual object tracking using adaptive correlation filters[C] //Proc of the 23rd IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2010: 2544−2550
    [29]
    Wang Mengmeng, Liu Yong, Huang Zeyi. Large margin object tracking with circulant feature maps[C] //Proc of the 30th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2017: 4021−4029
    [30]
    Huang Lianghua, Zhao Xin, Huang Kaiqi. GOT-10k: A large high-diversity benchmark for generic object tracking in the wild[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 43(5): 1562−1577
    [31]
    Muller M, Bibi A, Giancola S, et al. TrackingNet: A large-scale dataset and benchmark for object tracking in the wild[C] //Proc of the 15th European Conf on Computer Vision. Berlin: Springer, 2018: 300−317
    [32]
    Kingma D P, Jimmy B. Adam: A method for stochastic optimization[J]. arXiv preprint, arXiv: 1412. 6980, 2014
    [33]
    Yan Bin, Zhang Xinyu, Wang Dong, et al. Alpha-Refine: Boosting tracking performance by precise bounding box estimation[C] //Proc of the 34th IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2021: 5289−5298
    [34]
    Voigtlaender P, Luiten J, Torr P H S, et al. SiamR-CNN: Visual tracking by re-detection[C] //Proc of the 33rd IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2020: 6578−6588
    [35]
    Guo Dongyan, Shao Yanyan, Cui Ying, et al. Graph attention tracking[C] //Proc of the 34th IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2021: 9543−9552
    [36]
    Chen Zedu, Zhong Bineng, Li Guorong, et al. Siamese box adaptive network for visual tracking[C] //Proc of the 33rd IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2020: 6668−6677
    [37]
    Danelljan M, Bhat G, Shahbaz K F, et al. ECO: Efficient convolution operators for tracking[C] //Proc of the 30th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2017: 6638−6646
    [38]
    Nam H, Han B. Learning multi-domain convolutional neural networks for visual tracking[C] //Proc of the 29th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2016: 4293−4302
  • Related Articles

    [1]Fu Maozhong, Hu Haiyang, Li Zhongjin. Dynamic Resource Scheduling Method for GPU Cluster[J]. Journal of Computer Research and Development, 2023, 60(6): 1308-1321. DOI: 10.7544/issn1000-1239.202220149
    [2]Wang Ziyi, Hu Xiaoyu, Wang Xin, Zhang Xinggong, Cao Zhen, Zheng Kai, Cui Yong. Fairness Measurement and Algorithm Design of Network Transmission: A Case Study of Video Applications[J]. Journal of Computer Research and Development, 2023, 60(4): 810-827. DOI: 10.7544/issn1000-1239.202330022
    [3]Zhong Lujie, Wang Mu. Blockchain-Enpowered Cooperative Resource Allocation Scheme for Computing First Network[J]. Journal of Computer Research and Development, 2023, 60(4): 750-762. DOI: 10.7544/issn1000-1239.202330002
    [4]Fang Rongqiang, Wang Jing, Yao Zhicheng, Liu Chang, Zhang Weigong. Modeling Computational Feature of Multi-Layer Neural Network[J]. Journal of Computer Research and Development, 2019, 56(6): 1170-1181. DOI: 10.7544/issn1000-1239.2019.20190111
    [5]Xu Hongzhi, Li Renfa, Zeng Lining. Parallel Task Scheduling for Resource Consumption Minimization with Reliability Constraint[J]. Journal of Computer Research and Development, 2018, 55(11): 2569-2583. DOI: 10.7544/issn1000-1239.2018.20170893
    [6]Xu Ran, Wang Wendong, Gong Xiangyang, Que Xirong. Delay-Aware Resource Scheduling Optimization in Network Function Virtualization[J]. Journal of Computer Research and Development, 2018, 55(4): 738-747. DOI: 10.7544/issn1000-1239.2018.20170926
    [7]WeiWei, LiuYang, YangWeidong. A Fast Approximation Algorithm for the General Resource Placement Problem in Cloud Computing Platform[J]. Journal of Computer Research and Development, 2016, 53(3): 697-703. DOI: 10.7544/issn1000-1239.2016.20148323
    [8]Fan Pengyi, Wang Hui, Jiang Zhihong, and Li Pei. Measurement of Microblogging Network[J]. Journal of Computer Research and Development, 2012, 49(4): 691-699.
    [9]Xie Yingke, Wang Jiandong, Zhu Chao, Zhao Zili, Han Chengde. High Precision Timestamps in Network Measurement[J]. Journal of Computer Research and Development, 2010, 47(12).
    [10]Jiao Jian, Yao Shan, Li Xiaojian. Research on Network Bidirectional Topology Discovery Based on Measurer by Spreading[J]. Journal of Computer Research and Development, 2010, 47(5): 903-910.

Catalog

    Article views (156) PDF downloads (74) Cited by()

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return