GPT系列大语言模型在自然语言处理任务中的鲁棒性

陈炫婷; 叶俊杰; 祖璨; 许诺; 桂韬; 张奇

doi:10.7544/issn1000-1239.202330801

摘要: 大语言模型（large language models，LLMs）所展现的处理各种自然语言处理（natural language processing，NLP）任务的能力引发了广泛关注. 然而，它们在处理现实中各种复杂场景时的鲁棒性尚未得到充分探索，这对于评估模型的稳定性和可靠性尤为重要. 因此，使用涵盖了9个常见NLP任务的15个数据集（约147000个原始测试样本）和来自TextFlint的61种鲁棒的文本变形方法分析GPT-3和GPT-3.5系列模型在原始数据集上的性能，以及其在不同任务和文本变形级别（字符、词和句子）上的鲁棒性. 研究结果表明，GPT模型虽然在情感分析、语义匹配等分类任务和阅读理解任务中表现出良好的性能，但其处理信息抽取任务的能力仍较为欠缺，比如其对关系抽取任务中各种关系类型存在严重混淆，甚至出现“幻觉”现象. 在鲁棒性评估实验中，GPT模型在任务层面和变形层面的鲁棒性都较弱，其中，在分类任务和句子级别的变形中鲁棒性缺乏更为显著. 此外，探究了模型迭代过程中性能和鲁棒性的变化，以及提示中的演示数量和演示内容对模型性能和鲁棒性的影响. 结果表明，随着模型的迭代以及上下文学习的加入，模型的性能稳步提升，但是鲁棒性依然亟待提升. 这些发现从任务类型、变形种类、提示内容等方面揭示了GPT模型还无法完全胜任常见的NLP任务，并且模型存在的鲁棒性问题难以通过提升模型性能或改变提示内容等方式解决. 通过对gpt-3.5-turbo的更新版本、gpt-4模型，以及开源模型LLaMA2-7B和LLaMA2-13B的性能和鲁棒性表现进行对比，进一步验证了实验结论. 鉴于此，未来的大模型研究应当提升模型在信息提取以及语义理解等方面的能力，并且应当在模型训练或微调阶段考虑提升其鲁棒性.

Abstract: The GPT models have demonstrated impressive performance in various natural language processing (NLP) tasks. However, their robustness and abilities to handle various complexities of the open world have not yet to be well explored, which is especially crucial in assessing the stability of models and is a key aspect of trustworthy AI. In this study, we perform a comprehensive experimental analysis of GPT-3 and GPT-3.5 series models, exploring their performance and robustness using 15 datasets (about 147000 original test samples) with 61 robust probing transformations from TextFlint covering 9 popular NLP tasks. Additionally, we analyze the model’s robustness across different transformation levels, including character, word, and sentence. Our findings reveal that while GPT models exhibit competitive performance in certain tasks like sentiment analysis, semantic matching, and reading comprehension, they exhibit severe confusion regarding information extraction tasks. For instance, GPT models exhibit severe confusion in relation extraction and even exhibit “hallucination” phenomena. Moreover, they experience significant degradation in robustness in terms of tasks and transformations, especially in classification tasks and sentence-level transformations. Furthermore, we validate the impact of the quantity and the form of demonstrations on performance and robustness. These findings reveal that GPT models are still not fully proficient in handling common NLP tasks, and highlight the difficulty in addressing robustness challenges through enhancing model performance or altering prompt content. By comparing the performance and robustness of the updated version of gpt-3.5-turbo, gpt-4, LLaMA2-7B and LLaMA2-13B, we further validate the experimental findings. Future studies on large language models should strive to enhance their capacities in information extraction and semantic understanding, while simultaneously bolstering overall robustness.

GPT系列大语言模型在自然语言处理任务中的鲁棒性

Robustness of GPT Large Language Models on Natural Language Processing Tasks