亦正亦邪大模型——大模型与安全专题导读

虎嵩林; 李涓子; 秦兵; 邱锡鹏; 刘知远

doi:10.7544/issn1000-1239.qy20240501

亦正亦邪大模型——大模型与安全专题导读

计量
- 文章访问数: 635
- HTML全文浏览量: 248
- PDF下载量: 147
出版历程
- 刊出日期: 2024-05-13

摘要

https://www.humanornot.ai/

https://www.figure.ai/

https://www.safe.ai/work/statement-on-ai-risk

https://openai.com/blog/superalignment-fast-grants

HTML全文

https://www.humanornot.ai/

https://www.figure.ai/

https://www.safe.ai/work/statement-on-ai-risk

https://openai.com/blog/superalignment-fast-grants

参考文献(79)

[1]	Wei J, Tay Y, Bommasani R, et al. Emergent abilities of large language models [J/OL]. Transactions on Machine Learning Research, 2022, 1. [2024-04-29].https://openreview.net/forum?id=yzkSU5zdwD
[2]	OpenAI. GPT−4 technical report [J]. arXiv preprint, arXiv: 2305.10403, 2023
[3]	Biever C. ChatGPT broke the Turing test—the race is on for new ways to assess AI[J]. Nature, 2023, 619((7971): ): 686−689 doi: 10.1038/d41586-023-02361-7
[4]	Yang Kaiyu, Swope A, Gu A, et al. LeanDojo: Theorem proving with retrieval-augmented language models [C]//Advances in Neural Information Processing Systems 36 (NeurIPS 2023) Datasets and Benchmarks Track. New York: Curran Associates, Inc., 2023, 36: 21573−21612
[5]	Boiko D A, MacKnight R, Kline B, et al. Autonomous chemical research with large language models[J]. Nature, 2023, 624(7992): 570−578 doi: 10.1038/s41586-023-06792-0
[6]	Gervais D J, Nay J J. Artificial intelligence and interspecific law[J]. Science, 2023, 382(6669): 376−378 doi: 10.1126/science.adi8678
[7]	Service R F. Could chatbots help devise the next pandemic virus?[J]. Science, 2023, 380(6651): 1211−1211 doi: 10.1126/science.adj3377
[8]	Naddaf M. The science events to watch for in 2024[J]. Nature, 2024, 625(7994): 221−223 doi: 10.1038/d41586-023-04044-9
[9]	Lu Yao, Bartolo M, Moore A, et al. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity[C]//Proc of the 60th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2022: 8086−8098
[10]	Wang Sirui, Wei Kaiwen, Zhang Hongzhi, et al. Let me check the examples: Enhancing demonstration learning via explicit imitation[C]//Proc of the 61st Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2023: 1080−1088
[11]	Perez E, Ringer S, Lukosiute K, et al. Discovering language model behaviors with model-written evaluations[C]//Findings of the Association for Computational Linguistics: ACL 2023. Stroudsburg, PA: ACL, 2023: 13387−13434
[12]	Ouyang Long, Wu J, Jiang Xu, et al. Training language models to follow instructions with human feedback [C]//Advances in Neural Information Processing Systems 35 (NeurIPS 2022). New York: Curran Associates, Inc., 2022, 35: 27730−27744
[13]	Langosco L, Koch J, Sharkey L, et al. Goal misgeneralization in deep reinforcement learning[C]//Proc of Int Conf on Machine Learning. New York: PMLR, 2022: 12004−12019
[14]	Shah R, Varma V, Kumar R, et al. Goal misgeneralization: Why correct specifications aren't enough for correct goals[J]. arXiv preprint, arXiv:2210.01790, 2022.
[15]	Pan A, Bhatia K, Steinhardt J. The effects of reward misspecification: Mapping and mitigating misaligned models[C]//Proc of the 10th Int Conf on Learning Representations. 2022 [2024-04-29]. https://openreview.net/forum?id=JYtwGwIL7ye
[16]	Askell A, Bai Yuntao, Chen Anna, et al. A general language assistant as a laboratory for alignment[J]. arXiv preprint, arXiv: 2112.00861, 2021
[17]	Wei A, Haghtalab N, Steinhardt J. Jailbroken: How does LLM safety training fail?[C]// Advances in Neural Information Processing Systems 36 (NeurIPS 2023). New York: Curran Associates, Inc., 2023, 36: 80079−80110
[18]	Liang P, Bommasani R, Lee T, et al. Holistic evaluation of language models[J]. Annals of the New York Academy of Sciences, 2023, 1525(1): 140−146
[19]	Zhang Zhexin, Lei Leqi, Wu Lindong, et al. SafetyBench: Evaluating the safety of large language models with multiple choice questions[J]. arXiv preprint, arXiv: 2309.07045, 2023
[20]	Rudinger R, Naradowsky J, Leonard B, et al. Gender bias in coreference resolution[C]//Proc of the 2018 Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg, PA: ACL, 2018: 8−14
[21]	Lin S, Hilton J, Evans O, et al. TruthfulQA: Measuring how models mimic human falsehoods[C]//Proc of the 60th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2022: 3214−3252
[22]	Li Junyi , Cheng Xiaoxue, Zhao W X, et al. HaluEval: A large-scale hallucination evaluation benchmark for large language models[C]//Proc of the 2023 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2023: 6449−6464
[23]	Gehman S, Gururangan S, Sap M, et al. RealToxicityPrompts: Evaluating neural toxic degeneration in language models[C]//Findings of the Association for Computational Linguistics: EMNLP 2020. Stroudsburg, PA: ACL, 2020: 3356−3369
[24]	Xu Jing, Ju Da, Li M, et al. Bot-adversarial dialogue for safe conversational agents[C]// Proc of the 2021 Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg, PA: Association for Computational Linguistics, 2021: 2950−2968
[25]	Perez E, Huang S, Song F, et al. Red teaming language models with language models[C]// Proc of the 2022 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2022: 3419−3448
[26]	Mehrotra A, Zampetakis M, Kassianik P, et al. Tree of attacks: Jailbreaking black-box LLMs automatically[J]. arXiv preprint, arXiv: 2312.02119, 2023
[27]	Shayegani E, Mamun M A A, Fu Yu, et al. Survey of vulnerabilities in large language models revealed by adversarial attacks[J]. arXiv preprint, arXiv: 2310.10844, 2023
[28]	Liu Yi, Deng Gelei, Xu Zhengzi, et al. Jailbreaking chatgpt via prompt engineering: An empirical study[J]. arXiv preprint, arXiv: 2305.13860, 2023
[29]	Li Jie, Liu Yi, Liu Chongyang, et al. A cross-language investigation into jailbreak attacks in large language models[J]. arXiv preprint, arXiv: 2401.16765, 2024.
[30]	Lv Huijie, Wang Xiao, Zhang Yuansen, et al. CodeChameleon: Personalized encryption framework for jailbreaking large language models[J]. arXiv preprint, arXiv: 2402.16717, 2024
[31]	Deng Gelei, Liu Yi, Li Yuekang, et al. MASTERKEY: Automated jailbreaking of large language model chatbots[C/OL]// Proc of 2024 ISOC NDSS (Network and Distributed System Security Symposium) . [2024-04-29].https://www.ndss-symposium.org/wp-content/uploads/2024-188-paper.pdf
[32]	Chao P, Robey A, Dobriban E, et al. Jailbreaking black box large language models in twenty queries[C/OL]// Proc of Robustness of Few-shot and Zero-shot Learning in Large Foundation Models (R0-FoMo), NeurIPS 2023 Workshop. 2023[2024-04-29].https://openreview.net/forum?id=rYWD5TMaLj
[33]	Zeng Yi, Lin Hongpeng, Zhang Jingwen, et al. How Johnny can persuade LLMs to jailbreak them: Rethinking persuasion to challenge AI safety by humanizing LLMs[J]. arXiv preprint, arXiv: 2401.06373, 2024
[34]	Zou A, Wang Zifan, Kolter J Z, et al. Universal and transferable adversarial attacks on aligned language models[J]. arXiv preprint, arXiv: 2307.15043, 2023
[35]	Liu Xiaogeng, Xu Nan, Chen Muhao, et al. AutoDAN: Generating stealthy jailbreak prompts on aligned large language models[J]. arXiv preprint, arXiv: 2310.04451, 2023
[36]	Zhang Mi, Pan Xudong, Yang Min. JADE: A linguistics-based safety evaluation platform for large language models[J]. arXiv preprint, arXiv: 2311.00286, 2023
[37]	Zhao Xuandong, Yang Xianjun, Pang Tianyu, et al. Weak-to-strong jailbreaking on large language models[J]. arXiv preprint, arXiv: 2401.17256, 2024
[38]	Xu Zhihao, Huang Ruixuan, Wang Xiting, et al. Uncovering safety risks in open-source LLMs through concept activation vector[J]. arXiv preprint, arXiv: 2404.12038, 2024
[39]	Li Tianlong, Dou Shihan, Liu Wenhao, et al. Open the Pandora’s Box of LLMs: Jailbreaking LLMs through representation engineering[J]. arXiv preprint, arXiv: 2401.06824, 2024
[40]	Pasquini D, Strohmeier M, Troncoso C. Neural Exec: Learning (and learning from) execution triggers for prompt injection attacks[J]. arXiv preprint, arXiv: 2403.03792, 2024
[41]	Shi Jiawen, Yuan Zenghui, Liu Yinuo, et al. Optimization-based prompt injection attack to LLM-as-a-judge[J]. arXiv preprint, arXiv: 2403.17710, 2024
[42]	Zhang Yiming, Ippolito D. Prompts should not be seen as secrets: Systematically measuring prompt extraction attack success[J]. arXiv preprint, arXiv: 2307.06865, 2023
[43]	OpenAI. Moderation [EB/OL]. [2024-04-22].https://platform.openai.com/docs/guides/moderation/moderation
[44]	Jigsaw. About the API [EB/OL]. [2024-04-22].https://developers.perspectiveapi.com/s/about-the-api
[45]	Sen I, Assenmacher D, Samory M, et al. People make better edits: Measuring the efficacy of LLM-generated counterfactually augmented data for harmful language detection[C]//Proc of the 2023 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2023: 10480−10504
[46]	Hu Beizhe, Sheng Qiang, Cao Juan, et al. Bad actor, good advisor: Exploring the role of large language models in fake news detection[C]//Proc of the AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2024, 38(20): 22105−22113
[47]	Ji Ziwei, Lee N, Frieske R, et al. Survey of hallucination in natural language generation[J]. ACM Computing Surveys, 2023, 55(12): 1−38
[48]	Muhlgay D, Ram O, Magar I, et al. Generating benchmarks for factuality evaluation of language models[C]// Proc of the 18th Conf of the European Chapter of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2024: 49−66
[49]	Yang Xianjun, Cheng Wei, Wu Yue, et al. DNA-GPT: Divergent n-gram analysis for training-free detection of GPT-generated text[C/OL]//Proc of the 12th Int Conf on Learning Representations. 2024 [2024-04-29].https://openreview.net/forum?id=Xlayxj2fWp
[50]	Karamolegkou A, Li Jiaang, Zhou Li, et al. Copyright violations and large language models[C]//Proc of the 2023 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: Association for Computational Linguistics, 2023: 7403−7412
[51]	Bai Yuntao, Kadavath S, Kundu S, et al. Constitutional AI: Harmlessness from AI feedback[J]. arXiv preprint, arXiv: 2212.08073, 2022
[52]	Lightman H, Kosaraju V, Burda Y, et al. Let’s verify step by step[C/OL]//Proc of the 12th Int Conf on Learning Representations. 2024 [2024-04-29].https://openreview.net/forum?id=v8L0pN6EOi
[53]	Rafailov R, Sharma A, Mitchell E, et al. Direct preference optimization: Your language model is secretly a reward model[C]//Advances in Neural Information Processing Systems 36 (NeurIPS 2023). New York: Curran Associates, Inc., 2023, 36: 53728−53741
[54]	Qian Jing, Dong Li, Shen Yelong, et al. Controllable natural language generation with contrastive prefixes[C]//Findings of the Association for Computational Linguistics: ACL 2022. Stroudsburg, PA: ACL, 2022: 2912−2924
[55]	Dathathri S, Madotto A, Lan J, et al. Plug and play language models: A simple approach to controlled text generation[C/OL]//Proc of Int Conf on Learning Representations. 2020 [2024-04-29].https://openreview.net/forum?id=H1edEyBKDS
[56]	Li Xian, Yu Ping, Zhou Chunting, et al. Self-alignment with instruction backtranslation[C/OL]// Proc of the 12th Int Conf on Learning Representations. 2024 [2024-04-29].https://openreview.net/forum?id=1oijHJBRsT
[57]	Liu Ruibo, Yang Ruixin, Jia Chenyan, et al. Training socially aligned language models in simulated human society [C/OL]//Proc of the 12th Int Conf on Learning Representations. 2024 [2024-04-29].https://openreview.net/forum?id=NddKiWtdUm
[58]	Goyal S, Hira M, Mishra S, et al. LLMGuard: Guarding against unsafe LLM behavior[C]//Proc of the AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2024, 38(21): 23790−23792
[59]	Mündler N, He J, Jenko S, et al. Self-contradictory hallucinations of large language models: Evaluation, detection and mitigation[C/OL]// Proc of the 12th Int Conf on Learning Representations. 2024 [2024-04-29].https://openreview.net/forum?id=EmQSOi1X2f
[60]	Caselli T, Basile V, Mitrović J, et al. HateBERT: Retraining BERT for abusive language detection in English[C]//Proc of the 5th Workshop on Online Abuse and Harms (WOAH 2021). Stroudsburg, PA: ACL, 2021: 17−25
[61]	Gémes K, Kovács Á, Recski G. Offensive text detection across languages and datasets using rule-based and hybrid methods[C/OL]// Advances in Interpretable Machine Learning and Artificial Intelligence Workshop. 2022 [2024-04-29].https://ceur-ws.org/Vol-3318/short22.pdf
[62]	Longpre S, Yauney G, Reif E, et al. A pretrainer’s guide to training data: Measuring the effects of data age, domain coverage, quality, & toxicity[J]. arXiv preprint, arXiv: 2305.13169, 2023
[63]	Kandpal N, Wallace E, Raffel C. Deduplicating training data mitigates privacy risks in language models[C]//Proc of Int Conf on Machine Learning. New York: PMLR, 2022: 10697−10707
[64]	Lee K, Ippolito D, Nystrom A, et al. Deduplicating training data makes language models better[C]//Proc of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg, PA: ACL, 2022: 8424−8445
[65]	Lukas N, Salem A, Sim R, et al. Analyzing leakage of personally identifiable information in language models[C]//Proc of 2023 IEEE Symp on Security and Privacy (SP). Piscataway, NJ: IEEE, 2023: 346−363
[66]	Carlini N, Tramer F, Wallace E, et al. Extracting training data from large language models[C]//Proc of the 30th USENIX Security Symp (USENIX Security 21). Berkeley, CA: USENIX Association, 2021: 2633−2650
[67]	Mireshghallah F, Uniyal A, Wang Tianhao, et al. An empirical analysis of memorization in fine-tuned autoregressive language models[C]//Proc of the 2022 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2022: 1816−1826
[68]	Su Zhenpeng, Wu Xing, Bai Xue, et al. MiLe loss: A new loss for mitigating the bias of learning difficulties in generative language models [C/OL]// Proc of 2024 Annual Conf of the North American Chapter of the Association for Computational Linguistics, Stroudsburg, PA: ACL, 2024 [2024-04-29].https://arxiv.org/abs/2310.19531
[69]	Li Yansong, Tan Zhixing, Liu Yang. Privacy-preserving prompt tuning for large language model services[J]. arXiv preprint, arXiv: 2305.06212, 2023
[70]	Jiang Zhengbao, Xu F F, Gao Luyu, et al. Active retrieval augmented generation[C]//Proc of the 2023 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2023: 7969−7992
[71]	Zhang Yunxiang, Khalifa M, Logeswaran L, et al. Merging generated and retrieved knowledge for open-domain QA[C]//Proc of the 2023 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2023: 4710−4728
[72]	Xu Zhangchen, Jiang Fengqing, Niu Luyao, et al. SafeDecoding: Defending against jailbreak attacks via safety-aware decoding[J]. arXiv preprint, arXiv: 2402.08983, 2024
[73]	Li K, Patel O, Viégas F, et al. Inference-time intervention: Eliciting truthful answers from a language model[C]// Advances in Neural Information Processing Systems 36 (NeurIPS 2023). New York: Curran Associates, Inc., 2023, 36: 41451−41530
[74]	Varshney N, Yao Wenlin, Zhang Hongming, et al. A stitch in time saves nine: Detecting and mitigating hallucinations of LLMs by validating low-confidence generation[J]. arXiv preprint, arXiv: 2307.03987, 2023
[75]	Mozes M, He Xuanli, Kleinberg B, et al. Use of LLMs for illicit purposes: Threats, prevention measures, and vulnerabilities[J]. arXiv preprint, arXiv: 2308.12833, 2023
[76]	Yao Yunzhi, Wang Peng, Tian Bozhong, et al. Editing large language models: Problems, methods, and opportunities[C]//Proc of the 2023 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2023: 10222−10240
[77]	Ozdayi M, Peris C, FitzGerald J, et al. Controlling the extraction of memorized data from large language models via prompt-tuning[C]//Proc of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Stroudsburg, PA: ACL, 2023: 1512−1521
[78]	Brcic M, Yampolskiy R V. Impossibility results in AI: A survey[J]. ACM Computing Surveys, 2023, 56(1): 1−24
[79]	Wolf Y, Wies N, Avnery O, et al. Fundamental limitations of alignment in large language models[J]. arXiv preprint, arXiv: 2304.11082, 2023