Citation: | Wang Jiakai, Kong Yusheng, Chen Zhendong, Hu Jin, Yin Zixin, Ma Yuqing, Yang Qinghong, Liu Xianglong. Phonemic Adversarial Attack Against Audio Recognition in Physical World[J]. Journal of Computer Research and Development, 2025, 62(3): 751-764. DOI: 10.7544/issn1000-1239.202330445 |
Audio recognition has been widely applied in the typical scenarios, like auto-driving, Internet of things, and etc. In recent years, research on adversarial attacks in audio recognition has attracted extensive attention. However, most of the existing studies mainly rely on the coarse-grain audio features at the instance level, which leads to expensive generation time costs and weak universal attacking ability in real world. To address the problem, we propose a phonemic adversarial noise (PAN) generation paradigm, which exploits the audio features at the phoneme level to perform fast and universal adversarial attacks. Experiments are conducted using a variety of datasets commonly used in speech recognition tasks, such as LibriSpeech, to experimentally validate the effectiveness of the PAN proposed in this paper, its ability to generalize across datasets, its ability to migrate attacks across models, and its ability to migrate attacks across tasks, as well as further validating the effectiveness of the attack civilian-oriented Internet of things audio recognition application in the physical world devices. Extensive experiments demonstrate that the proposed PAN outperforms the comparative baselines by large margins (about 24 times speedup and 38% attacking ability improvement on average), and the sampling strategy and learning method proposed in this paper are significant in reducing the training time and improving the attack capability.
[1] |
Carlini N, Wagner D. Audio adversarial examples: Targeted attacks on speech-to-text[C/OL]//Proc of 2018 IEEE Security and Privacy Workshops (SPW). Piscataway, NJ: IEEE, 2018[2024-01-24]. https://ieeexplore.ieee.org/abstract/document/8424625
|
[2] |
Qin Yao, Carlini N, Cottrell G, et al. Imperceptible, robust, and targeted adversarial examples for automatic speech recognition[C]// Proc of the 36th Int Conf on Machine Learning. Cambridge, MA: JMLR, 2019: 5231−5240
|
[3] |
Liu Xiaolei, Wan Kun, Ding Yufei, et al. Weighted-sampling audio adversarial example attack[C]//Proc of the 34th AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2020: 4908−4915
|
[4] |
Qu Xinghua, Wei Pengfei, Gao Mingyong, et al. Synthesising audio adversarial examples for automatic speech recognition[J]. Proc of the 28th ACM SIGKDD Conf on Knowledge Discovery and Data Mining. New York: ACM, 2022: 1430−1440
|
[5] |
韩松莘,郭松辉,徐开勇,等. 基于帧结构的语音对抗样本重点区域扰动分析[J]. 计算机研究与发展,2024,61(3):685−700 doi: 10.7544/issn1000-1239.202221034
Han Songshen, Guo Songhui, Xu Kaiyong, et al. Perturbation analysis of the vital region in speech adversarial example based on frame structure[J]. Journal of Computer Research and Development, 2024, 61(3): 685−700 (in Chinese) doi: 10.7544/issn1000-1239.202221034
|
[6] |
Neekhara P, Hussain S, Pandey P, et al. Universal adversarial perturbations for speech recognition systems[J]. arXiv preprint, arXiv: 1905.03828, 2019
|
[7] |
Zong Wei, Chow Y W, Susilo W, et al. Targeted universal adversarial perturbations for automatic speech recognition[C]//Proc of the 24th Int Conf on Information Security. Berlin: Springer, 2021: 358−373
|
[8] |
Lu Zhiyun, Han Wei, Zhang Yu, et al. Exploring targeted universal adversarial perturbations to end-to-end ASR models[J]. arXiv preprint, arXiv: 2104.02757, 2021
|
[9] |
Mathov Y, Senior T B, Shabtai A, et al. Stop bugging me! evading modern-day wiretapping using adversarial perturbations[J]. Computers and Security, 2022, 121: 102841 doi: 10.1016/j.cose.2022.102841
|
[10] |
Li Jiguo, Zhang Xinfeng, Jia Chuanmin, et al. Universal adversarial perturbations generative network for speaker recognition[C/OL]//Proc of 2020 IEEE Int Conf on Multimedia and Expo (ICME). Piscataway, NJ: IEEE, 2020 [2024-01-24]. https://ieeexplore.ieee.org/document/9102886
|
[11] |
Xie Yi, Li Zhuohang, Shi Cong, et al. Enabling fast and universal audio adversarial attack using generative model[C]//Proc of the 35th AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2021: 14129−14137
|
[12] |
Twaddell W F. On defining the phoneme[J]. Language, 1935, 11(1): 5−62
|
[13] |
Malik M, Malik M K, Mehmood K, et al. Automatic speech recognition: A survey[J]. Multimedia Tools and Applications, 2020, 80: 9411−9457
|
[14] |
Hannun A, Case C, Casper J, et al. Deep speech: Scaling up end-to-end speech recognition[J]. arXiv preprint, arXiv: 1412.5567, 2014
|
[15] |
Amodei D, Ananthanarayanan S, Anubhai R, et al. Deep speech 2: End-to-end speech recognition in English and Mandarin[C]//Proc of the 33rd Int Conf on Machine Learning. Cambridge, MA: JMLR, 2016: 173−182
|
[16] |
Baevski A, Henry Z, Mohamed A, et al. Wav2Vec 2.0: A framework for self-supervised learning of speech representations[C]//Proc of the 34th Int Conf on Neural Information Processing Systems. New York: Curran Associates, 2020: 12449−12460
|
[17] |
Felix W, Kwangyoun K, Jing P, et al. Performance-efficiency trade-offs in unsupervised pre-training for speech recognition[C]//Proc of 2022 IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2022: 7667−7671
|
[18] |
Snyder D, Garcia-Romero D, Sell G, et al. X-vectors: Robust dnn embeddings for speaker recognition[C]//Proc of 2018 IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2018: 5329−5333
|
[19] |
Li Chao, Ma Xiaokong, Jiang Bing, et al. DeepSpeaker: An end-to-end neural speaker embedding system[J]. arXiv preprint, arXiv: 1705.02304, 2017
|
[20] |
Goodfellow I J, Shlens J, Szegedy C. Explaining and harnessing adversarial examples[J]. arXiv preprint, arXiv: 1412.6572, 2014
|
[21] |
Wang Jiakai. Adversarial examples in physical world[C]// Proc of the 30th Int Joint Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2021: 4925−4926
|
[22] |
Liu Aishan, Wang Jiakai, Liu Xianglong, et al. Bias-based universal adversarial patch attack for automatic check-out[C]//Proc of the 16th European Conf on Computer Vision. Berlin: Springer, 2020: 395−410
|
[23] |
Wang Jiakai, Yin Zixin, Hu Pengfei, et al. Defensive patches for robust recognition in the physical world[C]//Proc of 2022 IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2022: 2456−2465
|
[24] |
Wang Jiakai, Liu Aishan, Yin Zixin, et al. Dual attention suppression attack: Generate adversarial camouflage in physical world[C]//Proc of 2021 IEEE/CVF Conf on Computer Vision and Pattern Recognition (CVPR). Piscataway, NJ: IEEE, 2021: 8565−8574
|
[25] |
Kyubyong P, Jongseok K, Nicholas L. g2pE: A simple Python module for English grapheme to phoneme conversion[CP/OL]. 2019 [2024-01-24]. https://github.com/Kyubyong/g2p
|
[26] |
Shorten C, Khoshgoftaar T M. A survey on image data augmentation for deep learning[J]. Journal of Big Data, 2019, 6: 1−48 doi: 10.1186/s40537-019-0197-0
|
[27] |
Steven Y F, Varun G, Jason W, et al. A survey of data augmentation approaches for NLP[C]//Proc of 2021 Findings of the Association for Computational Linguistics: ACL/IJCNLP. Stroudsburg, PA: ACL, 2021: 968−988
|
[28] |
Peddinti V, Chen G, Povey D, et al. Reverberation robust acoustic modeling using i-vectors with time delay neural networks[C]//Proc of the 16th Annual Conf of the Int Speech Communication Association. Baixas, France: ISCA, 2015: 2440−2444
|
[29] |
Panayotov V, Chen G, Povey D, et al. LibriSpeech: An ASR corpus based on public domain audio books[C]//Proc of 2015 IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2015: 5206−5210
|
[30] |
Rousseau A, Deleglise P, Esteve Y. TED-LIUM: An automatic speech ´ recognition dedicated corpus[C]// Proc of the 8th Int Conf on Language Resources and Evaluation. Istanbul, Turkey: ELRA, 2012: 125−129
|
[31] |
Ardila R, Branson M, Davis K, et al. Common voice: A massively multilingual speech corpus[C]// Proc of the 12th Language Resources and Evaluation Conf. Istanbul, Turkey: ELRA, 2019: 4218–4222
|
[32] |
Johnson D H, Shami P N. The signal processing information base[J]. IEEE Signal Processing Magazine, 1993, 10(4): 36−42 doi: 10.1109/79.248556
|
[33] |
Yang Zhuolin, Li Bo, Pin-Yu C, et al. Characterizing audio adversarial examples using temporal dependency[J]. arXiv preprint, arXiv: 1809.10875, 2018
|
[34] |
Zhang Yechao, Hu Shengshan, Leo Y Z, et al. Why does little robustness help? Understanding adversarial transferability from surrogate training[J]. arXiv preprint, arXiv: 2307.07873, 2023
|
[1] | Wu Tianxing, Cao Xudong, Bi Sheng, Chen Ya, Cai Pingqiang, Sha Hangyu, Qi Guilin, Wang Haofen. Constructing Health Management Information System for Major Chronic Diseases Based on Large Language Model[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202440570 |
[2] | Zhao Yun, Liu Dexi, Wan Changxuan, Liu Xiping, Liao Guoqiong. Mental Health Text Matching Model Integrating Characters’ Mental Portrait[J]. Journal of Computer Research and Development, 2024, 61(7): 1812-1824. DOI: 10.7544/issn1000-1239.202220987 |
[3] | Fu Tao, Chen Zhaojiong, Ye Dongyi. GAN-Based Bidirectional Decoding Feature Fusion Extrapolation Algorithm of Chinese Landscape Painting[J]. Journal of Computer Research and Development, 2022, 59(12): 2816-2830. DOI: 10.7544/issn1000-1239.20210830 |
[4] | Gan Xinbiao, Tan Wen, Liu Jie. Bidirectional-Bitmap Based CSR for Reducing Large-Scale Graph Space[J]. Journal of Computer Research and Development, 2021, 58(3): 458-466. DOI: 10.7544/issn1000-1239.2021.20200090 |
[5] | Zhou Donghao, Han Wenbao, Wang Yongjun. A Fine-Grained Information Diffusion Model Based on Node Attributes and Content Features[J]. Journal of Computer Research and Development, 2015, 52(1): 156-166. DOI: 10.7544/issn1000-1239.2015.20130915 |
[6] | Li Yaxiong, Zhang Jianqiang, Pan Deng, Hu Dan. A Study of Speech Recognition Based on RNN-RBM Language Model[J]. Journal of Computer Research and Development, 2014, 51(9): 1936-1944. DOI: 10.7544/issn1000-1239.2014.20140211 |
[7] | Huang He, Sun Yu'e, Chen Zhili, Xu Hongli, Xing Kai, Chen Guoliang. Completely-Competitive-Equilibrium-Based Double Spectrum Auction Mechanism[J]. Journal of Computer Research and Development, 2014, 51(3): 479-490. |
[8] | Zhu Feng, Luo Limin, Song Yuqing, Chen Jianmei, Zuo Xin. Adaptive Spatially Neighborhood Information Gaussian Mixture Model for Image Segmentation[J]. Journal of Computer Research and Development, 2011, 48(11): 2000-2007. |
[9] | Ma Xiao, Wang Xuan, and Wang Xiaolong. The Information Model for a Class of Imperfect Information Game[J]. Journal of Computer Research and Development, 2010, 47(12). |
[10] | Ma Liang, Chen Qunxiu, and Cai Lianhong. An Improved Model for Adaptive Text Information Filtering[J]. Journal of Computer Research and Development, 2005, 42(1): 79-84. |