Abstract:
Attacks towards the speaker recognition system need to inject a long-time perturbation, so it is easy to be detected by machines or administrators. We propose a novel attack towards the speaker recognition based on one-“audio pixel”. Such attack uses the black-box characteristics and search mode of the differential evolution algorithm that does not rely on the model and the gradient information. It overcomes the problem in previous works that the disturbance duration cannot be constrained. Thus, our attack effectively spoofs the speaker recognition via one-“audio pixel” perturbation. In particular, we design a candidate point construction model based on the audio-point-disturbance tuple targeting time series of audio data. It solves the problem that candidate points of differential evolution algorithm are difficult to be described against our attack. The success rate of our attack achieves 100% targeting 60 people in LibriSpeech dataset. In addition, we also conduct abundant experiments to explore the impact of different conditions (e.g., gender, dataset and speaker recognition method) on the performance of our stealthy attack. The result of above experiments provides guidance for effective attacks. At the same time, we put forward ideas based on denoising, reconstruction algorithm and speech compression to defend against our stealthy attack, respectively.