• 中国精品科技期刊
  • CCF推荐A类中文期刊
  • 计算领域高质量科技期刊T1类
高级检索

亦正亦邪大模型——大模型与安全专题导读

虎嵩林, 李涓子, 秦兵, 邱锡鹏, 刘知远

虎嵩林, 李涓子, 秦兵, 邱锡鹏, 刘知远. 亦正亦邪大模型——大模型与安全专题导读[J]. 计算机研究与发展, 2024, 61(5): 1085-1093. DOI: 10.7544/issn1000-1239.qy20240501
引用本文: 虎嵩林, 李涓子, 秦兵, 邱锡鹏, 刘知远. 亦正亦邪大模型——大模型与安全专题导读[J]. 计算机研究与发展, 2024, 61(5): 1085-1093. DOI: 10.7544/issn1000-1239.qy20240501
虎嵩林, 李涓子, 秦兵, 邱锡鹏, 刘知远. 亦正亦邪大模型——大模型与安全专题导读[J]. 计算机研究与发展, 2024, 61(5): 1085-1093. CSTR: 32373.14.issn1000-1239.qy20240501
引用本文: 虎嵩林, 李涓子, 秦兵, 邱锡鹏, 刘知远. 亦正亦邪大模型——大模型与安全专题导读[J]. 计算机研究与发展, 2024, 61(5): 1085-1093. CSTR: 32373.14.issn1000-1239.qy20240501

亦正亦邪大模型——大模型与安全专题导读

  • https://www.humanornot.ai/
    https://www.figure.ai/
    https://www.safe.ai/work/statement-on-ai-risk
    https://openai.com/blog/superalignment-fast-grants
  • [1]

    Wei J, Tay Y, Bommasani R, et al. Emergent abilities of large language models [J/OL]. Transactions on Machine Learning Research, 2022, 1. [2024-04-29].https://openreview.net/forum?id=yzkSU5zdwD

    [2]

    OpenAI. GPT−4 technical report [J]. arXiv preprint, arXiv: 2305.10403, 2023

    [3]

    Biever C. ChatGPT broke the Turing test—the race is on for new ways to assess AI[J]. Nature, 2023, 619((7971): ): 686−689 doi: 10.1038/d41586-023-02361-7

    [4]

    Yang Kaiyu, Swope A, Gu A, et al. LeanDojo: Theorem proving with retrieval-augmented language models [C]//Advances in Neural Information Processing Systems 36 (NeurIPS 2023) Datasets and Benchmarks Track. New York: Curran Associates, Inc., 2023, 36: 21573−21612

    [5]

    Boiko D A, MacKnight R, Kline B, et al. Autonomous chemical research with large language models[J]. Nature, 2023, 624(7992): 570−578 doi: 10.1038/s41586-023-06792-0

    [6]

    Gervais D J, Nay J J. Artificial intelligence and interspecific law[J]. Science, 2023, 382(6669): 376−378 doi: 10.1126/science.adi8678

    [7]

    Service R F. Could chatbots help devise the next pandemic virus?[J]. Science, 2023, 380(6651): 1211−1211 doi: 10.1126/science.adj3377

    [8]

    Naddaf M. The science events to watch for in 2024[J]. Nature, 2024, 625(7994): 221−223 doi: 10.1038/d41586-023-04044-9

    [9]

    Lu Yao, Bartolo M, Moore A, et al. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity[C]//Proc of the 60th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2022: 8086−8098

    [10]

    Wang Sirui, Wei Kaiwen, Zhang Hongzhi, et al. Let me check the examples: Enhancing demonstration learning via explicit imitation[C]//Proc of the 61st Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2023: 1080−1088

    [11]

    Perez E, Ringer S, Lukosiute K, et al. Discovering language model behaviors with model-written evaluations[C]//Findings of the Association for Computational Linguistics: ACL 2023. Stroudsburg, PA: ACL, 2023: 13387−13434

    [12]

    Ouyang Long, Wu J, Jiang Xu, et al. Training language models to follow instructions with human feedback [C]//Advances in Neural Information Processing Systems 35 (NeurIPS 2022). New York: Curran Associates, Inc., 2022, 35: 27730−27744

    [13]

    Langosco L, Koch J, Sharkey L, et al. Goal misgeneralization in deep reinforcement learning[C]//Proc of Int Conf on Machine Learning. New York: PMLR, 2022: 12004−12019

    [14]

    Shah R, Varma V, Kumar R, et al. Goal misgeneralization: Why correct specifications aren't enough for correct goals[J]. arXiv preprint, arXiv:2210.01790, 2022.

    [15]

    Pan A, Bhatia K, Steinhardt J. The effects of reward misspecification: Mapping and mitigating misaligned models[C]//Proc of the 10th Int Conf on Learning Representations. 2022 [2024-04-29]. https://openreview.net/forum?id=JYtwGwIL7ye

    [16]

    Askell A, Bai Yuntao, Chen Anna, et al. A general language assistant as a laboratory for alignment[J]. arXiv preprint, arXiv: 2112.00861, 2021

    [17]

    Wei A, Haghtalab N, Steinhardt J. Jailbroken: How does LLM safety training fail?[C]// Advances in Neural Information Processing Systems 36 (NeurIPS 2023). New York: Curran Associates, Inc., 2023, 36: 80079−80110

    [18]

    Liang P, Bommasani R, Lee T, et al. Holistic evaluation of language models[J]. Annals of the New York Academy of Sciences, 2023, 1525(1): 140−146

    [19]

    Zhang Zhexin, Lei Leqi, Wu Lindong, et al. SafetyBench: Evaluating the safety of large language models with multiple choice questions[J]. arXiv preprint, arXiv: 2309.07045, 2023

    [20]

    Rudinger R, Naradowsky J, Leonard B, et al. Gender bias in coreference resolution[C]//Proc of the 2018 Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg, PA: ACL, 2018: 8−14

    [21]

    Lin S, Hilton J, Evans O, et al. TruthfulQA: Measuring how models mimic human falsehoods[C]//Proc of the 60th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2022: 3214−3252

    [22]

    Li Junyi , Cheng Xiaoxue, Zhao W X, et al. HaluEval: A large-scale hallucination evaluation benchmark for large language models[C]//Proc of the 2023 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2023: 6449−6464

    [23]

    Gehman S, Gururangan S, Sap M, et al. RealToxicityPrompts: Evaluating neural toxic degeneration in language models[C]//Findings of the Association for Computational Linguistics: EMNLP 2020. Stroudsburg, PA: ACL, 2020: 3356−3369

    [24]

    Xu Jing, Ju Da, Li M, et al. Bot-adversarial dialogue for safe conversational agents[C]// Proc of the 2021 Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg, PA: Association for Computational Linguistics, 2021: 2950−2968

    [25]

    Perez E, Huang S, Song F, et al. Red teaming language models with language models[C]// Proc of the 2022 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2022: 3419−3448

    [26]

    Mehrotra A, Zampetakis M, Kassianik P, et al. Tree of attacks: Jailbreaking black-box LLMs automatically[J]. arXiv preprint, arXiv: 2312.02119, 2023

    [27]

    Shayegani E, Mamun M A A, Fu Yu, et al. Survey of vulnerabilities in large language models revealed by adversarial attacks[J]. arXiv preprint, arXiv: 2310.10844, 2023

    [28]

    Liu Yi, Deng Gelei, Xu Zhengzi, et al. Jailbreaking chatgpt via prompt engineering: An empirical study[J]. arXiv preprint, arXiv: 2305.13860, 2023

    [29]

    Li Jie, Liu Yi, Liu Chongyang, et al. A cross-language investigation into jailbreak attacks in large language models[J]. arXiv preprint, arXiv: 2401.16765, 2024.

    [30]

    Lv Huijie, Wang Xiao, Zhang Yuansen, et al. CodeChameleon: Personalized encryption framework for jailbreaking large language models[J]. arXiv preprint, arXiv: 2402.16717, 2024

    [31]

    Deng Gelei, Liu Yi, Li Yuekang, et al. MASTERKEY: Automated jailbreaking of large language model chatbots[C/OL]// Proc of 2024 ISOC NDSS (Network and Distributed System Security Symposium) . [2024-04-29].https://www.ndss-symposium.org/wp-content/uploads/2024-188-paper.pdf

    [32]

    Chao P, Robey A, Dobriban E, et al. Jailbreaking black box large language models in twenty queries[C/OL]// Proc of Robustness of Few-shot and Zero-shot Learning in Large Foundation Models (R0-FoMo), NeurIPS 2023 Workshop. 2023[2024-04-29].https://openreview.net/forum?id=rYWD5TMaLj

    [33]

    Zeng Yi, Lin Hongpeng, Zhang Jingwen, et al. How Johnny can persuade LLMs to jailbreak them: Rethinking persuasion to challenge AI safety by humanizing LLMs[J]. arXiv preprint, arXiv: 2401.06373, 2024

    [34]

    Zou A, Wang Zifan, Kolter J Z, et al. Universal and transferable adversarial attacks on aligned language models[J]. arXiv preprint, arXiv: 2307.15043, 2023

    [35]

    Liu Xiaogeng, Xu Nan, Chen Muhao, et al. AutoDAN: Generating stealthy jailbreak prompts on aligned large language models[J]. arXiv preprint, arXiv: 2310.04451, 2023

    [36]

    Zhang Mi, Pan Xudong, Yang Min. JADE: A linguistics-based safety evaluation platform for large language models[J]. arXiv preprint, arXiv: 2311.00286, 2023

    [37]

    Zhao Xuandong, Yang Xianjun, Pang Tianyu, et al. Weak-to-strong jailbreaking on large language models[J]. arXiv preprint, arXiv: 2401.17256, 2024

    [38]

    Xu Zhihao, Huang Ruixuan, Wang Xiting, et al. Uncovering safety risks in open-source LLMs through concept activation vector[J]. arXiv preprint, arXiv: 2404.12038, 2024

    [39]

    Li Tianlong, Dou Shihan, Liu Wenhao, et al. Open the Pandora’s Box of LLMs: Jailbreaking LLMs through representation engineering[J]. arXiv preprint, arXiv: 2401.06824, 2024

    [40]

    Pasquini D, Strohmeier M, Troncoso C. Neural Exec: Learning (and learning from) execution triggers for prompt injection attacks[J]. arXiv preprint, arXiv: 2403.03792, 2024

    [41]

    Shi Jiawen, Yuan Zenghui, Liu Yinuo, et al. Optimization-based prompt injection attack to LLM-as-a-judge[J]. arXiv preprint, arXiv: 2403.17710, 2024

    [42]

    Zhang Yiming, Ippolito D. Prompts should not be seen as secrets: Systematically measuring prompt extraction attack success[J]. arXiv preprint, arXiv: 2307.06865, 2023

    [43]

    OpenAI. Moderation [EB/OL]. [2024-04-22].https://platform.openai.com/docs/guides/moderation/moderation

    [44]

    Jigsaw. About the API [EB/OL]. [2024-04-22].https://developers.perspectiveapi.com/s/about-the-api

    [45]

    Sen I, Assenmacher D, Samory M, et al. People make better edits: Measuring the efficacy of LLM-generated counterfactually augmented data for harmful language detection[C]//Proc of the 2023 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2023: 10480−10504

    [46]

    Hu Beizhe, Sheng Qiang, Cao Juan, et al. Bad actor, good advisor: Exploring the role of large language models in fake news detection[C]//Proc of the AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2024, 38(20): 22105−22113

    [47]

    Ji Ziwei, Lee N, Frieske R, et al. Survey of hallucination in natural language generation[J]. ACM Computing Surveys, 2023, 55(12): 1−38

    [48]

    Muhlgay D, Ram O, Magar I, et al. Generating benchmarks for factuality evaluation of language models[C]// Proc of the 18th Conf of the European Chapter of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2024: 49−66

    [49]

    Yang Xianjun, Cheng Wei, Wu Yue, et al. DNA-GPT: Divergent n-gram analysis for training-free detection of GPT-generated text[C/OL]//Proc of the 12th Int Conf on Learning Representations. 2024 [2024-04-29].https://openreview.net/forum?id=Xlayxj2fWp

    [50]

    Karamolegkou A, Li Jiaang, Zhou Li, et al. Copyright violations and large language models[C]//Proc of the 2023 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: Association for Computational Linguistics, 2023: 7403−7412

    [51]

    Bai Yuntao, Kadavath S, Kundu S, et al. Constitutional AI: Harmlessness from AI feedback[J]. arXiv preprint, arXiv: 2212.08073, 2022

    [52]

    Lightman H, Kosaraju V, Burda Y, et al. Let’s verify step by step[C/OL]//Proc of the 12th Int Conf on Learning Representations. 2024 [2024-04-29].https://openreview.net/forum?id=v8L0pN6EOi

    [53]

    Rafailov R, Sharma A, Mitchell E, et al. Direct preference optimization: Your language model is secretly a reward model[C]//Advances in Neural Information Processing Systems 36 (NeurIPS 2023). New York: Curran Associates, Inc., 2023, 36: 53728−53741

    [54]

    Qian Jing, Dong Li, Shen Yelong, et al. Controllable natural language generation with contrastive prefixes[C]//Findings of the Association for Computational Linguistics: ACL 2022. Stroudsburg, PA: ACL, 2022: 2912−2924

    [55]

    Dathathri S, Madotto A, Lan J, et al. Plug and play language models: A simple approach to controlled text generation[C/OL]//Proc of Int Conf on Learning Representations. 2020 [2024-04-29].https://openreview.net/forum?id=H1edEyBKDS

    [56]

    Li Xian, Yu Ping, Zhou Chunting, et al. Self-alignment with instruction backtranslation[C/OL]// Proc of the 12th Int Conf on Learning Representations. 2024 [2024-04-29].https://openreview.net/forum?id=1oijHJBRsT

    [57]

    Liu Ruibo, Yang Ruixin, Jia Chenyan, et al. Training socially aligned language models in simulated human society [C/OL]//Proc of the 12th Int Conf on Learning Representations. 2024 [2024-04-29].https://openreview.net/forum?id=NddKiWtdUm

    [58]

    Goyal S, Hira M, Mishra S, et al. LLMGuard: Guarding against unsafe LLM behavior[C]//Proc of the AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2024, 38(21): 23790−23792

    [59]

    Mündler N, He J, Jenko S, et al. Self-contradictory hallucinations of large language models: Evaluation, detection and mitigation[C/OL]// Proc of the 12th Int Conf on Learning Representations. 2024 [2024-04-29].https://openreview.net/forum?id=EmQSOi1X2f

    [60]

    Caselli T, Basile V, Mitrović J, et al. HateBERT: Retraining BERT for abusive language detection in English[C]//Proc of the 5th Workshop on Online Abuse and Harms (WOAH 2021). Stroudsburg, PA: ACL, 2021: 17−25

    [61]

    Gémes K, Kovács Á, Recski G. Offensive text detection across languages and datasets using rule-based and hybrid methods[C/OL]// Advances in Interpretable Machine Learning and Artificial Intelligence Workshop. 2022 [2024-04-29].https://ceur-ws.org/Vol-3318/short22.pdf

    [62]

    Longpre S, Yauney G, Reif E, et al. A pretrainer’s guide to training data: Measuring the effects of data age, domain coverage, quality, & toxicity[J]. arXiv preprint, arXiv: 2305.13169, 2023

    [63]

    Kandpal N, Wallace E, Raffel C. Deduplicating training data mitigates privacy risks in language models[C]//Proc of Int Conf on Machine Learning. New York: PMLR, 2022: 10697−10707

    [64]

    Lee K, Ippolito D, Nystrom A, et al. Deduplicating training data makes language models better[C]//Proc of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg, PA: ACL, 2022: 8424−8445

    [65]

    Lukas N, Salem A, Sim R, et al. Analyzing leakage of personally identifiable information in language models[C]//Proc of 2023 IEEE Symp on Security and Privacy (SP). Piscataway, NJ: IEEE, 2023: 346−363

    [66]

    Carlini N, Tramer F, Wallace E, et al. Extracting training data from large language models[C]//Proc of the 30th USENIX Security Symp (USENIX Security 21). Berkeley, CA: USENIX Association, 2021: 2633−2650

    [67]

    Mireshghallah F, Uniyal A, Wang Tianhao, et al. An empirical analysis of memorization in fine-tuned autoregressive language models[C]//Proc of the 2022 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2022: 1816−1826

    [68]

    Su Zhenpeng, Wu Xing, Bai Xue, et al. MiLe loss: A new loss for mitigating the bias of learning difficulties in generative language models [C/OL]// Proc of 2024 Annual Conf of the North American Chapter of the Association for Computational Linguistics, Stroudsburg, PA: ACL, 2024 [2024-04-29].https://arxiv.org/abs/2310.19531

    [69]

    Li Yansong, Tan Zhixing, Liu Yang. Privacy-preserving prompt tuning for large language model services[J]. arXiv preprint, arXiv: 2305.06212, 2023

    [70]

    Jiang Zhengbao, Xu F F, Gao Luyu, et al. Active retrieval augmented generation[C]//Proc of the 2023 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2023: 7969−7992

    [71]

    Zhang Yunxiang, Khalifa M, Logeswaran L, et al. Merging generated and retrieved knowledge for open-domain QA[C]//Proc of the 2023 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2023: 4710−4728

    [72]

    Xu Zhangchen, Jiang Fengqing, Niu Luyao, et al. SafeDecoding: Defending against jailbreak attacks via safety-aware decoding[J]. arXiv preprint, arXiv: 2402.08983, 2024

    [73]

    Li K, Patel O, Viégas F, et al. Inference-time intervention: Eliciting truthful answers from a language model[C]// Advances in Neural Information Processing Systems 36 (NeurIPS 2023). New York: Curran Associates, Inc., 2023, 36: 41451−41530

    [74]

    Varshney N, Yao Wenlin, Zhang Hongming, et al. A stitch in time saves nine: Detecting and mitigating hallucinations of LLMs by validating low-confidence generation[J]. arXiv preprint, arXiv: 2307.03987, 2023

    [75]

    Mozes M, He Xuanli, Kleinberg B, et al. Use of LLMs for illicit purposes: Threats, prevention measures, and vulnerabilities[J]. arXiv preprint, arXiv: 2308.12833, 2023

    [76]

    Yao Yunzhi, Wang Peng, Tian Bozhong, et al. Editing large language models: Problems, methods, and opportunities[C]//Proc of the 2023 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2023: 10222−10240

    [77]

    Ozdayi M, Peris C, FitzGerald J, et al. Controlling the extraction of memorized data from large language models via prompt-tuning[C]//Proc of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Stroudsburg, PA: ACL, 2023: 1512−1521

    [78]

    Brcic M, Yampolskiy R V. Impossibility results in AI: A survey[J]. ACM Computing Surveys, 2023, 56(1): 1−24

    [79]

    Wolf Y, Wies N, Avnery O, et al. Fundamental limitations of alignment in large language models[J]. arXiv preprint, arXiv: 2304.11082, 2023

  • 期刊类型引用(1)

    1. 邵良杉,赵松泽. 基于多模型融合的不完整数据分数插补算法. 计算机工程. 2023(09): 79-88+98 . 百度学术

    其他类型引用(4)

计量
  • 文章访问数:  566
  • HTML全文浏览量:  240
  • PDF下载量:  138
  • 被引次数: 5
出版历程
  • 刊出日期:  2024-05-13

目录

    /

    返回文章
    返回