Abstract:
We propose a universal safety testing benchmark for large language models (LLMs), JADE-DB. The benchmark is automatically constructed via the targeted mutation approach, which is able to convert test questions that are manually crafted by experienced LLM testers and multidisciplinary experts to highly threatening universal test questions. The converted questions still preserve the naturalness of human language without changing the core semantics of the original question, and in the meantime are able to consistently break over ten widely-used LLMs. Based on the incremental linguistic complexity, JADE-DB incorporates three levels of LLM safety testing, namely, basic, advanced and dangerous, which accounts for thousands of test questions covering 4 major unsafe generation categories, i.e., crime, tort, bias and core values, spanning over 30 unsafe topics. Specifically, we construct three dangerous safety benchmarks respectively for the three groups of LLMs, i.e., eight open-sourced Chinese, six commercial Chinese and four commercial English LLMs. The benchmarks simultaneously trigger harmful generation of multiple LLMs, with an average unsafe generation ratio of 70%. The results indicate that, due to the complexity of human language, most of the current best LLMs can hardly learn the infinite number of different syntactic structures of human language and thus recognize the invariant evil therein.