Abstract:
Audio recognition has been widely applied in the typical scenarios, like Auto-Driving, Internet of Things, and
etc. In recent years, research on adversarial attacks in audio recognition has attracted extensive attention. However, most of the existing studies mainly rely on the coarse-grain audio features at the instance level, which leads to expensive generation time costs and weak universal attacking ability in real world. To address the problem, this paper proposes a phonemic adversarial noise (PAN) generation paradigm, which exploits the audio features at the phoneme level to perform fast and universal adversarial attacks. Experiments were conducted using a variety of datasets commonly used in speech recognition tasks, such as LibriSpeech, to experimentally validate the effectiveness of the PAN proposed in this paper, its ability to generalize across datasets, its ability to migrate attacks across models, and its ability to migrate attacks across tasks, as well as further validating the effectiveness of the attack civilian-oriented Internet of Things audio recognition application in the physical world devices. Extensive experiments demonstrate that the proposed PAN outperforms the compared baselines by large margins (about 24× speedup and 38% attacking ability improvement on average), and the sampling strategy and learning method proposed in this paper are significant in reducing the training time and improving the attack capability.