Phonemic Adversarial Attack Against Audio Recognition in Physical World

Wang Jiakai; Kong Yusheng; Chen Zhendong; Hu Jin; Yin Zixin; Ma Yuqing; Yang Qinghong; Liu Xianglong

doi:10.7544/issn1000-1239.202330445

Journal of Computer Research and Development > 2025 > 62(3): 751-764. > DOI: 10.7544/issn1000-1239.202330445 CSTR: 32373.14.issn1000-1239.202330445

Wang Jiakai, Kong Yusheng, Chen Zhendong, Hu Jin, Yin Zixin, Ma Yuqing, Yang Qinghong, Liu Xianglong. Phonemic Adversarial Attack Against Audio Recognition in Physical World[J]. Journal of Computer Research and Development, 2025, 62(3): 751-764. DOI: 10.7544/issn1000-1239.202330445

Citation:

PDF (2097 KB)

Phonemic Adversarial Attack Against Audio Recognition in Physical World

Wang Jiakai^{1, 2,},
Kong Yusheng^{1, 3},
Chen Zhendong^{4, 5},
Hu Jin²,
Yin Zixin²,
Ma Yuqing²,
Yang Qinghong⁵,
Liu Xianglong^{1, 2, 6, ,}

1.
Beijing Zhongguancun Laboratory, Beijing 100194
2.
State Key Laboratory of Complex and Critical Software Environment (Beihang University), Beijing 100191
3.
School of Automation Science and Electric Engineering, Beihang University, Beijing 100191
4.
Polixir Technologies, Nanjing 211106
5.
School of Software, Beihang University, Beijing 100191
6.
Institute of Data Space, Hefei Comprehensive National Science Center, Hefei 231200

More Information

Author Bio:
Wang Jiakai: born in 1995. PhD, associate professor. His mian research interests include physical adversarial attacks and defenses, model security evaluation, and trustworthy AI

Kong Yusheng: born in 1995. PhD candidate, engineer. His main research interests include physical adversarial attacks and defenses, industrial artificial intelligence, and IIoT cyber security

Chen Zhendong: born in 1998. Master, assistant engineer. His main research interest includes the audio adversarial attack

Hu Jin: born in 2000. PhD candidate. His main research interests include physical adversarial attacks, AI attack and defense simulation, and trustworthy artificial intelligence

Yin Zixin: born in 1999. Master. His main research interests include physical adversarial attacks and defenses, and trustworthy artificial intelligence

Ma Yuqing: born in 1992. PhD, associate professor. Her main research interests include open-world learning, model generalization, and trustworthy AI

Yang Qinghong: born in 1974. PhD, professor of engineering, PhD supervisor. Her main research interests include software engineering and artificial intelligence

Liu Xianglong: born in 1985. PhD, professor, PhD supervisor. Senior member of CCF. His main research interests include machine learning, computer vision, and multimedia information retrieval
Received Date: June 04, 2023
Revised Date: May 09, 2024
Accepted Date: June 06, 2024
Available Online: June 30, 2024

Graphical Abstract

Abstract

Abstract

Audio recognition has been widely applied in the typical scenarios, like auto-driving, Internet of things, and etc. In recent years, research on adversarial attacks in audio recognition has attracted extensive attention. However, most of the existing studies mainly rely on the coarse-grain audio features at the instance level, which leads to expensive generation time costs and weak universal attacking ability in real world. To address the problem, we propose a phonemic adversarial noise (PAN) generation paradigm, which exploits the audio features at the phoneme level to perform fast and universal adversarial attacks. Experiments are conducted using a variety of datasets commonly used in speech recognition tasks, such as LibriSpeech, to experimentally validate the effectiveness of the PAN proposed in this paper, its ability to generalize across datasets, its ability to migrate attacks across models, and its ability to migrate attacks across tasks, as well as further validating the effectiveness of the attack civilian-oriented Internet of things audio recognition application in the physical world devices. Extensive experiments demonstrate that the proposed PAN outperforms the comparative baselines by large margins (about 24 times speedup and 38% attacking ability improvement on average), and the sampling strategy and learning method proposed in this paper are significant in reducing the training time and improving the attack capability.
- security of artificial intelligence,
- adversarial attack,
- audio recognition,
- physical attack,
- phonemic adversarial noise

FullText(HTML)

References (34)

References

[1]	Carlini N, Wagner D. Audio adversarial examples: Targeted attacks on speech-to-text[C/OL]//Proc of 2018 IEEE Security and Privacy Workshops (SPW). Piscataway, NJ: IEEE, 2018[2024-01-24]. https://ieeexplore.ieee.org/abstract/document/8424625
[2]	Qin Yao, Carlini N, Cottrell G, et al. Imperceptible, robust, and targeted adversarial examples for automatic speech recognition[C]// Proc of the 36th Int Conf on Machine Learning. Cambridge, MA: JMLR, 2019: 5231−5240
[3]	Liu Xiaolei, Wan Kun, Ding Yufei, et al. Weighted-sampling audio adversarial example attack[C]//Proc of the 34th AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2020: 4908−4915
[4]	Qu Xinghua, Wei Pengfei, Gao Mingyong, et al. Synthesising audio adversarial examples for automatic speech recognition[J]. Proc of the 28th ACM SIGKDD Conf on Knowledge Discovery and Data Mining. New York: ACM, 2022: 1430−1440
[5]	韩松莘,郭松辉,徐开勇,等. 基于帧结构的语音对抗样本重点区域扰动分析[J]. 计算机研究与发展,2024,61(3):685−700 doi: 10.7544/issn1000-1239.202221034 Han Songshen, Guo Songhui, Xu Kaiyong, et al. Perturbation analysis of the vital region in speech adversarial example based on frame structure[J]. Journal of Computer Research and Development, 2024, 61(3): 685−700 (in Chinese) doi: 10.7544/issn1000-1239.202221034
[6]	Neekhara P, Hussain S, Pandey P, et al. Universal adversarial perturbations for speech recognition systems[J]. arXiv preprint, arXiv: 1905.03828, 2019
[7]	Zong Wei, Chow Y W, Susilo W, et al. Targeted universal adversarial perturbations for automatic speech recognition[C]//Proc of the 24th Int Conf on Information Security. Berlin: Springer, 2021: 358−373
[8]	Lu Zhiyun, Han Wei, Zhang Yu, et al. Exploring targeted universal adversarial perturbations to end-to-end ASR models[J]. arXiv preprint, arXiv: 2104.02757, 2021
[9]	Mathov Y, Senior T B, Shabtai A, et al. Stop bugging me! evading modern-day wiretapping using adversarial perturbations[J]. Computers and Security, 2022, 121: 102841 doi: 10.1016/j.cose.2022.102841
[10]	Li Jiguo, Zhang Xinfeng, Jia Chuanmin, et al. Universal adversarial perturbations generative network for speaker recognition[C/OL]//Proc of 2020 IEEE Int Conf on Multimedia and Expo (ICME). Piscataway, NJ: IEEE, 2020 [2024-01-24]. https://ieeexplore.ieee.org/document/9102886
[11]	Xie Yi, Li Zhuohang, Shi Cong, et al. Enabling fast and universal audio adversarial attack using generative model[C]//Proc of the 35th AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2021: 14129−14137
[12]	Twaddell W F. On defining the phoneme[J]. Language, 1935, 11(1): 5−62
[13]	Malik M, Malik M K, Mehmood K, et al. Automatic speech recognition: A survey[J]. Multimedia Tools and Applications, 2020, 80: 9411−9457
[14]	Hannun A, Case C, Casper J, et al. Deep speech: Scaling up end-to-end speech recognition[J]. arXiv preprint, arXiv: 1412.5567, 2014
[15]	Amodei D, Ananthanarayanan S, Anubhai R, et al. Deep speech 2: End-to-end speech recognition in English and Mandarin[C]//Proc of the 33rd Int Conf on Machine Learning. Cambridge, MA: JMLR, 2016: 173−182
[16]	Baevski A, Henry Z, Mohamed A, et al. Wav2Vec 2.0: A framework for self-supervised learning of speech representations[C]//Proc of the 34th Int Conf on Neural Information Processing Systems. New York: Curran Associates, 2020: 12449−12460
[17]	Felix W, Kwangyoun K, Jing P, et al. Performance-efficiency trade-offs in unsupervised pre-training for speech recognition[C]//Proc of 2022 IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2022: 7667−7671
[18]	Snyder D, Garcia-Romero D, Sell G, et al. X-vectors: Robust dnn embeddings for speaker recognition[C]//Proc of 2018 IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2018: 5329−5333
[19]	Li Chao, Ma Xiaokong, Jiang Bing, et al. DeepSpeaker: An end-to-end neural speaker embedding system[J]. arXiv preprint, arXiv: 1705.02304, 2017
[20]	Goodfellow I J, Shlens J, Szegedy C. Explaining and harnessing adversarial examples[J]. arXiv preprint, arXiv: 1412.6572, 2014
[21]	Wang Jiakai. Adversarial examples in physical world[C]// Proc of the 30th Int Joint Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2021: 4925−4926
[22]	Liu Aishan, Wang Jiakai, Liu Xianglong, et al. Bias-based universal adversarial patch attack for automatic check-out[C]//Proc of the 16th European Conf on Computer Vision. Berlin: Springer, 2020: 395−410
[23]	Wang Jiakai, Yin Zixin, Hu Pengfei, et al. Defensive patches for robust recognition in the physical world[C]//Proc of 2022 IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2022: 2456−2465
[24]	Wang Jiakai, Liu Aishan, Yin Zixin, et al. Dual attention suppression attack: Generate adversarial camouflage in physical world[C]//Proc of 2021 IEEE/CVF Conf on Computer Vision and Pattern Recognition (CVPR). Piscataway, NJ: IEEE, 2021: 8565−8574
[25]	Kyubyong P, Jongseok K, Nicholas L. g2pE: A simple Python module for English grapheme to phoneme conversion[CP/OL]. 2019 [2024-01-24]. https://github.com/Kyubyong/g2p
[26]	Shorten C, Khoshgoftaar T M. A survey on image data augmentation for deep learning[J]. Journal of Big Data, 2019, 6: 1−48 doi: 10.1186/s40537-019-0197-0
[27]	Steven Y F, Varun G, Jason W, et al. A survey of data augmentation approaches for NLP[C]//Proc of 2021 Findings of the Association for Computational Linguistics: ACL/IJCNLP. Stroudsburg, PA: ACL, 2021: 968−988
[28]	Peddinti V, Chen G, Povey D, et al. Reverberation robust acoustic modeling using i-vectors with time delay neural networks[C]//Proc of the 16th Annual Conf of the Int Speech Communication Association. Baixas, France: ISCA, 2015: 2440−2444
[29]	Panayotov V, Chen G, Povey D, et al. LibriSpeech: An ASR corpus based on public domain audio books[C]//Proc of 2015 IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP). Piscataway, NJ: IEEE, 2015: 5206−5210
[30]	Rousseau A, Deleglise P, Esteve Y. TED-LIUM: An automatic speech ´ recognition dedicated corpus[C]// Proc of the 8th Int Conf on Language Resources and Evaluation. Istanbul, Turkey: ELRA, 2012: 125−129
[31]	Ardila R, Branson M, Davis K, et al. Common voice: A massively multilingual speech corpus[C]// Proc of the 12th Language Resources and Evaluation Conf. Istanbul, Turkey: ELRA, 2019: 4218–4222
[32]	Johnson D H, Shami P N. The signal processing information base[J]. IEEE Signal Processing Magazine, 1993, 10(4): 36−42 doi: 10.1109/79.248556
[33]	Yang Zhuolin, Li Bo, Pin-Yu C, et al. Characterizing audio adversarial examples using temporal dependency[J]. arXiv preprint, arXiv: 1809.10875, 2018
[34]	Zhang Yechao, Hu Shengshan, Leo Y Z, et al. Why does little robustness help? Understanding adversarial transferability from surrogate training[J]. arXiv preprint, arXiv: 2307.07873, 2023

[1]	Wu Tianxing, Cao Xudong, Bi Sheng, Chen Ya, Cai Pingqiang, Sha Hangyu, Qi Guilin, Wang Haofen. Constructing Health Management Information System for Major Chronic Diseases Based on Large Language Model[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202440570
[2]	Zhao Yun, Liu Dexi, Wan Changxuan, Liu Xiping, Liao Guoqiong. Mental Health Text Matching Model Integrating Characters’ Mental Portrait[J]. Journal of Computer Research and Development, 2024, 61(7): 1812-1824. DOI: 10.7544/issn1000-1239.202220987
[3]	Fu Tao, Chen Zhaojiong, Ye Dongyi. GAN-Based Bidirectional Decoding Feature Fusion Extrapolation Algorithm of Chinese Landscape Painting[J]. Journal of Computer Research and Development, 2022, 59(12): 2816-2830. DOI: 10.7544/issn1000-1239.20210830
[4]	Gan Xinbiao, Tan Wen, Liu Jie. Bidirectional-Bitmap Based CSR for Reducing Large-Scale Graph Space[J]. Journal of Computer Research and Development, 2021, 58(3): 458-466. DOI: 10.7544/issn1000-1239.2021.20200090
[5]	Zhou Donghao, Han Wenbao, Wang Yongjun. A Fine-Grained Information Diffusion Model Based on Node Attributes and Content Features[J]. Journal of Computer Research and Development, 2015, 52(1): 156-166. DOI: 10.7544/issn1000-1239.2015.20130915
[6]	Li Yaxiong, Zhang Jianqiang, Pan Deng, Hu Dan. A Study of Speech Recognition Based on RNN-RBM Language Model[J]. Journal of Computer Research and Development, 2014, 51(9): 1936-1944. DOI: 10.7544/issn1000-1239.2014.20140211
[7]	Huang He, Sun Yu'e, Chen Zhili, Xu Hongli, Xing Kai, Chen Guoliang. Completely-Competitive-Equilibrium-Based Double Spectrum Auction Mechanism[J]. Journal of Computer Research and Development, 2014, 51(3): 479-490.
[8]	Zhu Feng, Luo Limin, Song Yuqing, Chen Jianmei, Zuo Xin. Adaptive Spatially Neighborhood Information Gaussian Mixture Model for Image Segmentation[J]. Journal of Computer Research and Development, 2011, 48(11): 2000-2007.
[9]	Ma Xiao, Wang Xuan, and Wang Xiaolong. The Information Model for a Class of Imperfect Information Game[J]. Journal of Computer Research and Development, 2010, 47(12).
[10]	Ma Liang, Chen Qunxiu, and Cai Lianhong. An Improved Model for Adaptive Text Information Filtering[J]. Journal of Computer Research and Development, 2005, 42(1): 79-84.