高级检索
    王嘉凯, 孔宇升, 陈镇东, 胡琎, 尹子鑫, 马宇晴, 杨晴虹, 刘祥龙. 针对音频识别的物理世界音素对抗攻击[J]. 计算机研究与发展. DOI: 10.7544/issn1000-1239.202330445
    引用本文: 王嘉凯, 孔宇升, 陈镇东, 胡琎, 尹子鑫, 马宇晴, 杨晴虹, 刘祥龙. 针对音频识别的物理世界音素对抗攻击[J]. 计算机研究与发展. DOI: 10.7544/issn1000-1239.202330445
    Wang Jiakai, Kong Yusheng, Chen Zhendong, Hu Jin, Yin Zixin, Ma Yuqing, Yang Qinghong, Liu Xianglong. Phonemic Adversarial Attack Against Audio Recognition in Real World[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202330445
    Citation: Wang Jiakai, Kong Yusheng, Chen Zhendong, Hu Jin, Yin Zixin, Ma Yuqing, Yang Qinghong, Liu Xianglong. Phonemic Adversarial Attack Against Audio Recognition in Real World[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202330445

    针对音频识别的物理世界音素对抗攻击

    Phonemic Adversarial Attack Against Audio Recognition in Real World

    • 摘要: 语音识别等智能技术在自动驾驶、物联网等场景下得到了广泛的应用. 近年来,针对语音识别的对抗攻击研究逐渐受到关注. 然而,现有的大多数研究主要依赖于粗粒度的音频特征来在实例级别生成对抗噪声,这导致生成时间成本高昂且攻击能力弱. 考虑到所有语音可以被视为基本音素的不同组合,提出了一个基于音素的通用对抗攻击方法(phonemic adversarial noise,PAN),该方法通过攻击在音频数据中普遍存在的、音素级别的细粒度音频特征,以生成音素级对抗噪声,取得了更快的对抗噪声生成速度并具备更强的通用攻击能力. 为了全面地评估所提出的PAN框架,在实验中基于LibriSpeech等多种语音识别任务中被广泛采用的公开数据集,对提出的音素对抗噪声的攻击有效性、跨数据集的泛化能力、跨模型迁移攻击能力和跨任务迁移攻击能力进行了验证,并进一步在物理世界设备中证实了对民用智能音频识别应用的攻击效果. 实验结果表明,所提出的方法比其他对比方法的攻击能力提高了38%,生成速度快了24倍以上,且提出的采样策略和学习方法对降低训练时间和提升攻击能力具有重要作用.

       

      Abstract: Audio recognition has been widely applied in the typical scenarios, like Auto-Driving, Internet of Things, and etc. In recent years, research on adversarial attacks in audio recognition has attracted extensive attention. However, most of the existing studies mainly rely on the coarse-grain audio features at the instance level, which leads to expensive generation time costs and weak universal attacking ability in real world. To address the problem, this paper proposes a phonemic adversarial noise (PAN) generation paradigm, which exploits the audio features at the phoneme level to perform fast and universal adversarial attacks. Experiments were conducted using a variety of datasets commonly used in speech recognition tasks, such as LibriSpeech, to experimentally validate the effectiveness of the PAN proposed in this paper, its ability to generalize across datasets, its ability to migrate attacks across models, and its ability to migrate attacks across tasks, as well as further validating the effectiveness of the attack civilian-oriented Internet of Things audio recognition application in the physical world devices. Extensive experiments demonstrate that the proposed PAN outperforms the compared baselines by large margins (about 24× speedup and 38% attacking ability improvement on average), and the sampling strategy and learning method proposed in this paper are significant in reducing the training time and improving the attack capability.

       

    /

    返回文章
    返回