Abstract:
At present, adversarial attacks on speech recognition models have typically involved adding noise to the entire speech signal, resulting in a wide perturbation range and introducing high-frequency noise. Existing research has attempted to reduce the perturbation range by designing optimization targets. However, controlling the transcription result requires adding perturbations to each frame, thus limiting further reduction in perturbation range. To address this issue, we propose a novel approach that examines the feature extraction process of speech recognition systems from a frame structure perspective. The study finds that framing and windowing determine the distribution of critical regions within the frame structure. Specifically, the weight of adding perturbation to each sampling point within the frame is influenced by its location. Based on the results of perturbation analysis on input features, we partition regions with shared attributes. Then we propose the adversarial example space measurement method and evaluation index to quantify the weight of sampling points for adversarial examples generation. We conduct cross-experiments by adding perturbations at different intervals within the frame, which enables us to identify key regions for perturbation addition. Our experiments on multiple models demonstrate that adding adversarial perturbation to vital regions can narrow the perturbation range, and provide a new perspective for generating high-quality audio adversarial examples.