Abstract:
Given the risk of adversarial attacks on tracking models and the lack of relevant adversarial detection methods, this paper addresses the problem from the perspective of frequency domain. Combined with the visual invisible property of perturbation noise, this paper first theoretically proves that perturbation noise mainly exists in the mid-to-high frequency bands of images. Then we quantitatively analyze that the low-frequency components of the video sequence contribute the most to tracking performance and are least affected by adversarial attacks. Finally, based on the above theoretical proof and qualitative analysis, this paper proposes a detection framework based on the tracking performance difference of frequency bands, in which the frequency domain decomposition module for extracting the low-frequency components of the video sequence. The target tracker and its mirror tracker with the same structure and parameters respectively take the full-frequency and low-frequency components of the video sequence as input. The discriminator module determines whether the input video sequence is an adversarial input by comparing the output differences of the two trackers. This detection framework uses a tracker as a carrier and does not require adversarial training. It can achieve adversarial detection only by comparing the tracking performance difference across different frequency bands. Extensive experimental results show that the detection framework can not only effectively detect current mainstream adversarial attacks, such as CSA, TTP, and Spark with a detection precision of 97.55%, but also has little negative impact on the original tracking performance of the tracker. In addition, this framework is generalizable and can be flexibly integrated into multiple trackers, such as SiamRPNpp, SiamMask, SiamCAR, and SiamBAN.