高级检索

    JADE-DB:基于靶向变异的大语言模型安全通用基准测试集

    JADE-DB:A Universal Testing Benchmark for Large Language Model Safety Based on Targeted Mutation

    • 摘要: 提出大语言模型安全通用基准测试集—JADE-DB,该数据集基于靶向变异方法自动化构建,能够将经验丰富的大语言模型安全测试员和多学科专家学者手工撰写的测试问题转化为高危通用问题,保持语言自然性的同时不改变其核心语义,且能够攻破十余款国内外知名大语言模型的安全防护机制. 根据语言复杂性差异,JADE-DB包含基础、进阶、高危3个安全测试等级,共计上千条覆盖违法犯罪、侵犯权益、歧视偏见和核心价值观4大类违规主题、30多种违规主题的通用测试问题,其中针对国内开源(中文,8款)、国内商用(中文,6款)和国外商用大语言模型(英文,4款)这3组大语言模型分别构建的3款通用高危测试集,可造成每组模型在高危测试集上的平均违规率均超过 70%,测试问题均可同时触发多款模型违规生成. 这表明,语言的复杂性导致现有大语言模型难以学习到人类无穷多种表达方式,因此无法识别其中不变的违规本质.

       

      Abstract: We propose a universal safety testing benchmark for large language models (LLMs), JADE-DB. The benchmark is automatically constructed via the targeted mutation approach, which is able to convert test questions that are manually crafted by experienced LLM testers and multidisciplinary experts to highly threatening universal test questions. The converted questions still preserve the naturalness of human language without changing the core semantics of the original question, and in the meantime are able to consistently break over ten widely-used LLMs. Based on the incremental linguistic complexity, JADE-DB incorporates three levels of LLM safety testing, namely, basic, advanced and dangerous, which accounts for thousands of test questions covering 4 major unsafe generation categories, i.e., crime, tort, bias and core values, spanning over 30 unsafe topics. Specifically, we construct three dangerous safety benchmarks respectively for the three groups of LLMs, i.e., eight open-sourced Chinese, six commercial Chinese and four commercial English LLMs. The benchmarks simultaneously trigger harmful generation of multiple LLMs, with an average unsafe generation ratio of 70%. The results indicate that, due to the complexity of human language, most of the current best LLMs can hardly learn the infinite number of different syntactic structures of human language and thus recognize the invariant evil therein.

       

    /

    返回文章
    返回