Abstract:
AI Computility integrates geographically distributed intelligent computing resources via high-speed networks, providing efficient and elastic infrastructure for large-scale intelligent applications. Multi-Agent Debate (MAD), as a distributed multi-expert collaborative paradigm, connects multiple large language model (LLM) agents through computility networks to engage in iterative discussions from distinct roles, effectively enhancing collaborative reasoning quality. However, when facing jailbreak attacks—a typical class of attacks targeting LLMs—the security of the collaborative reasoning paradigm in MAD remains unexplored. To address this gap, a jailbreak attack method targeting MAD is investigated under a semi-black-box threat model. A structured prompt-rewriting template is designed that systematically integrates four strategies—narrative encapsulation, role-driven escalation, iterative refinement, and rhetorical obfuscation—to exploit the interactive mechanisms of multi-agent debate for inducing harmful outputs. Experiments on four mainstream MAD frameworks with GPT-4o, GPT-4, GPT-3.5-turbo, and DeepSeek demonstrate that the proposed method increases the average harmfulness of generated content by 28.14% to 80.34%, with attack success rates reaching 80% in certain scenarios. Building on these findings, preliminary defense experiments are further conducted across the input, generation, and propagation stages, providing initial evidence for the feasibility of lightweight multi-stage protection in MAD systems. These findings expose potential security vulnerabilities within multi-agent debate, validating the necessity of developing robust and targeted defense strategies before the widespread deployment of multi-agent collaborative systems under AI Computility.