基于结构化提示改写的多智能体辩论越狱攻击

邹逸飞; 林子逸; 齐森茂; 袁媛; 程业; 李鹏; 于东晓

doi:10.7544/issn1000-1239.202660117

基于结构化提示改写的多智能体辩论越狱攻击

Jailbreak Attacks on Multi-Agent Debate via Structured Prompt Rewriting

摘要

摘要: AI算力网通过高速网络整合分布各地的智能算力资源，为大规模智能应用提供高效、弹性的基础设施支撑。多智能体辩论作为一种分布式多专家协同范式，依托算力网络连接多个大语言模型智能体，以差异化角色进行多轮迭代讨论，有效提升了协同推理质量。然而，在面对越狱攻击等针对大语言模型的典型攻击时，多智能体辩论中协同推理模式的安全性仍缺乏探索。为此，本文在半黑盒威胁模型下提出一种针对多智能体辩论的结构化提示改写模板：通过叙事封装规避安全检测、角色驱动升级制造对抗压力、迭代精化逐步诱导有害细节、修辞混淆掩盖恶意意，协同利用辩论交互机制生成有害输出。在四种主流多智能体辩论框架上使用GPT-4o、GPT-4、GPT-3.5-turbo及DeepSeek开展实验，结果表明该方法使生成内容的平均有害性提升28.14%至80.34%，攻击成功率在特定场景下高达80%。在此基础上，本文进一步围绕输入、生成、传播三个环节开展了初步防御实验，结果初步验证了多环节轻量级防护思路在多智能体辩论系统中的可行性。上述发现揭示了多智能体辩论在本实验设置下的安全脆弱性，验证了在AI算力网下大规模部署多智能体协作系统之前，开发针对性鲁棒防御策略的必要性。

Abstract: AI Computility integrates geographically distributed intelligent computing resources via high-speed networks, providing efficient and elastic infrastructure for large-scale intelligent applications. Multi-Agent Debate (MAD), as a distributed multi-expert collaborative paradigm, connects multiple large language model (LLM) agents through computility networks to engage in iterative discussions from distinct roles, effectively enhancing collaborative reasoning quality. However, when facing jailbreak attacks—a typical class of attacks targeting LLMs—the security of the collaborative reasoning paradigm in MAD remains unexplored. To address this gap, a jailbreak attack method targeting MAD is investigated under a semi-black-box threat model. A structured prompt-rewriting template is designed that systematically integrates four strategies—narrative encapsulation, role-driven escalation, iterative refinement, and rhetorical obfuscation—to exploit the interactive mechanisms of multi-agent debate for inducing harmful outputs. Experiments on four mainstream MAD frameworks with GPT-4o, GPT-4, GPT-3.5-turbo, and DeepSeek demonstrate that the proposed method increases the average harmfulness of generated content by 28.14% to 80.34%, with attack success rates reaching 80% in certain scenarios. Building on these findings, preliminary defense experiments are further conducted across the input, generation, and propagation stages, providing initial evidence for the feasibility of lightweight multi-stage protection in MAD systems. These findings expose potential security vulnerabilities within multi-agent debate, validating the necessity of developing robust and targeted defense strategies before the widespread deployment of multi-agent collaborative systems under AI Computility.

HTML全文

参考文献(50)

施引文献

资源附件(0)