Citation: | Chen Xuanting, Ye Junjie, Zu Can, Xu Nuo, Gui Tao, Zhang Qi. Robustness of GPT Large Language Models on Natural Language Processing Tasks[J]. Journal of Computer Research and Development, 2024, 61(5): 1128-1142. DOI: 10.7544/issn1000-1239.202330801 |
The GPT models have demonstrated impressive performance in various natural language processing (NLP) tasks. However, their robustness and abilities to handle various complexities of the open world have not yet to be well explored, which is especially crucial in assessing the stability of models and is a key aspect of trustworthy AI. In this study, we perform a comprehensive experimental analysis of GPT-3 and GPT-3.5 series models, exploring their performance and robustness using 15 datasets (about 147000 original test samples) with 61 robust probing transformations from TextFlint covering 9 popular NLP tasks. Additionally, we analyze the model’s robustness across different transformation levels, including character, word, and sentence. Our findings reveal that while GPT models exhibit competitive performance in certain tasks like sentiment analysis, semantic matching, and reading comprehension, they exhibit severe confusion regarding information extraction tasks. For instance, GPT models exhibit severe confusion in relation extraction and even exhibit “hallucination” phenomena. Moreover, they experience significant degradation in robustness in terms of tasks and transformations, especially in classification tasks and sentence-level transformations. Furthermore, we validate the impact of the quantity and the form of demonstrations on performance and robustness. These findings reveal that GPT models are still not fully proficient in handling common NLP tasks, and highlight the difficulty in addressing robustness challenges through enhancing model performance or altering prompt content. By comparing the performance and robustness of the updated version of gpt-3.5-turbo, gpt-4, LLaMA2-7B and LLaMA2-13B, we further validate the experimental findings. Future studies on large language models should strive to enhance their capacities in information extraction and semantic understanding, while simultaneously bolstering overall robustness.
[1] |
Wei J, Bosma M, Zhao V Y, et al. Finetuned language models are zero-shot learners[J]. arXiv preprint, arXiv: 2109.01652, 2021
|
[2] |
Brown T B, Mann B, Ryder N, et al. Language models are few-shot learners[C/OL]//Advances in Neural Information Processing Systems. [2023-09-10].https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html
|
[3] |
Touvron H, Lavril T, Izacard G, et al. LLaMA: Open and efficient foundation language models[J]. arXiv preprint, arXiv: 2302.13971, 2023
|
[4] |
Anil R, Dai A M, Firat O, et al. PaLM 2 technical report[J]. arXiv preprint, arXiv: 2305.10403, 2023
|
[5] |
Kaplan J, McCandlish S, Henighan T, et al. Scaling laws for neural language models[J]. arXiv preprint, arXiv: 2001.08361, 2020
|
[6] |
Qin Chengwei, Zhang A, Zhang Zhuosheng, et al. Is ChatGPT a general-purpose natural language processing task solver?[C]//Proc of the 2023 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2023: 1339–1384
|
[7] |
Bang Y, Cahyawijaya S, Lee N, et al. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity[J]. arXiv preprint, arXiv: 2302.04023, 2023
|
[8] |
Kosinski M. Theory of mind may have spontaneously emerged in large language models[J]. arXiv preprint, arXiv: 2302.02083, 2023
|
[9] |
Frieder S, Pinchetti L, Griffiths R R, et al. Mathematical capabilities of ChatGPT[C/OL]//Advances in Neural Information Processing Systems. [2023-09-10]. https://neurips.cc/virtual/2023/poster/73421
|
[10] |
Wang Jindong, Hu Xixu, Hou Wenxin, et al. On the robustness of ChatGPT: An adversarial and out-of-distribution perspective[J]. arXiv preprint, arXiv: 2302.12095, 2023
|
[11] |
Wang Boxin, Xu Chejian, Wang Shuohang, et al. Adversarial glue: A multi-task benchmark for robustness evaluation of language models[J]. arXiv preprint, arXiv: 2111.02840, 2021
|
[12] |
Nie Y, Williams A, Dinan E, et al. Adversarial NLI: A new benchmark for natural language understanding[C]//Proc of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2020: 4885−4901
|
[13] |
Fansi T A, Goel R, Wen Zhi, et al. DDXPlus: A new dataset for automatic medical diagnosis[C/OL]//Advances in Neural Information Processing Systems. [2023-09-10].https://proceedings.neurips.cc/paper_files/paper/2022/hash/cae73a974390c0edd95ae7aeae09139c-Abstract-Datasets_and_Benchmarks.html
|
[14] |
Zhu Kaijie, Wang Jindong, Zhou Jiaheng, et al. PromptBench: Towards evaluating the robustness of large language models on adversarial prompts[J]. arXiv preprint, arXiv: 2306.04528, 2023
|
[15] |
Wang Xiao, Liu Qin, Gui Tao, et al. TextFlint: Unified multilingual robustness evaluation toolkit for natural language processing[C]//Proc of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th Int Joint Conf on Natural Language Processing: System Demonstrations. Stroudsburg, PA: ACL, 2021: 347−355
|
[16] |
Dong Qingxiu, Li Lei, Dai Damai, et al. A survey for in-context learning[J]. arXiv preprint, arXiv: 2301.00234, 2022
|
[17] |
Zhuo T Y, Huang Yujin, Chen Chunyang, et al. Exploring AI ethics of ChatGPT: A diagnostic analysis[J]. arXiv preprint, arXiv: 2301.12867, 2023
|
[18] |
Choi J H, Hickman K E, Monahan A B, et al. ChatGPT goes to law school[J]. Journal of Legal Education, 2021, 71(3): 387
|
[19] |
Khalil M, Er E. Will ChatGPT get you caught? Rethinking of plagiarism detection[C]//Proc of Int Conf on Human-Computer Interaction. Berlin: Springer, 2023: 475−487
|
[20] |
Alshater M. Exploring the role of artificial intelligence in enhancing academic performance: A case study of ChatGPT[J/OL]. [2023-09-12]. http://dx. doi.org/10.2139/ssrn.4312358
|
[21] |
Tabone W, De Winter J. Using ChatGPT for human–computer interaction research: A primer[J]. Royal Society Open Science, 2023, 10(9): 21
|
[22] |
Jeblick K, Schachtner B, Dexl J, et al. ChatGPT makes medicine easy to swallow: An exploratory case study on simplified radiology reports[J]. European Radiology, 2023: 1−9
|
[23] |
Biswas S. ChatGPT and the future of medical writing[J]. Radiology, 2023, 307(2): e223312 doi: 10.1148/radiol.223312
|
[24] |
Goodfellow I J, Shlens J, Szegedy C. Explaining and harnessing adversarial examples[J]. arXiv preprint, arXiv: 1412.6572, 2014
|
[25] |
Pontiki M, Galanis D, Pavlopoulos J, et al. SemEval-2014 Task 4: Aspect based sentiment analysis[C]//Proc of the 8th Int Workshop on Semantic Evaluation. Stroudsburg, PA: ACL, 2004: 27−35
|
[26] |
Maas A L, Daly R E, Pham P T, et al. Learning word vectors for sentiment analysis[C]// Proc of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg, PA: ACL, 2011: 142−150
|
[27] |
Williams A, Nangia N, Bowman S R. A broad-coverage challenge corpus for sentence understanding through inference[J]. arXiv preprint, arXiv: 1704.05426, 2017
|
[28] |
Dolan W B, Brockett C. Automatically constructing a corpus of sentential paraphrases[C]//Proc of the 3rd Int Workshop on Paraphrasing (IWP2005). Jeju Island: Asia Federation of Natural Language Processing, 2005: 9−16
|
[29] |
Wang Zhiguo, Hamza W, Florian R. Bilateral multi-perspective matching for natural language sentences[C]//Proc of the 26th Int Joint Conf on Artificial Intelligence. Australia: IJCAI. org, 2017: 4144−4150
|
[30] |
Levesque H, Davis E, Morgenstern L. The winograd schema challenge[C]//Proc of 13th Int Conf on the Principles of Knowledge Representation and Reasoning. Palo Alto, CA: AAAI, 2012: 552−561
|
[31] |
Rajpurkar P, Zhang Jian, Lopyrev K, et al. SQuAD: 100, 000+ questions for machine comprehension of text[C]//Proc of the 2016 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2016: 2383−2392
|
[32] |
Rajpurkar P, Jia R, Liang P. Know what you don’t know: Unanswerable questions for SQuAD[C]//Proc of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Stroudsburg, PA: ACL, 2018: 784−789
|
[33] |
Marcus M, Santorini B, Marcinkiewicz M A. Building a large annotated corpus of English: The Penn Treebank[J]. Computational Linguistics, 1993, 19(2): 313−330
|
[34] |
Sang E T K, De Meulder F. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition[C]//Proc of the 7th Conf on Natural Language Learning at HLT-NAACL 2003. Stroudsburg, PA: ACL, 2003: 142−147
|
[35] |
Weischedel R, Palmer M, Marcus M, et al. Ontonotes release 5.0 ldc2013t19[J]. Linguistic Data Consortium, 2013, 23(1): 170
|
[36] |
Zhang Yuhao, Zhong V, Chen Danqi, et al. Position-aware attention and supervised data improve slot filling[C]//Proc of Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2017: 35−45
|
[37] |
Ouyang L, Wu J, Jiang X, et al. Training language models to follow instructions with human feedback[C/OL]//Advances in Neural Information Processing Systems. [2023-09-10]. https://proceedings.neurips.cc/paper_files/paper/2022/hash/b1efde53be364a73914f58805a001731-Abstract-Conference.html
|
[38] |
Christiano P F, Leike J, Brown T, et al. Deep reinforcement learning from human preferences[C/OL]// Advances in Neural Information Processing Systems. [2023-09-10]. https://papers.nips.cc/paper_files/paper/2017/hash/d5e2c0adad503c91f91df240d0cd4e49-Abstract.html
|
[39] |
Kenton J D M W C, Toutanova L K. BERT: Pre-training of deep bidirectional Transformers for language understanding[C]//Proc of NAACL-HLT. Stroudsburg, PA: ACL, 2019: 4171−4186
|
[40] |
Chen Lingjiao, Zaharia M, Zou J. How is ChatGPT’s behavior changing over time?[J]. arXiv preprint, arXiv: 2307.09009, 2023
|
[41] |
Tu Shangqing, Li Chunyang, Yu Jifan, et al. ChatLog: Recording and analyzing ChatGPT across time[J]. arXiv preprint, arXiv: 2304.14106, 2023
|
[1] | Yang Lihua, Dong Yong, Wu Huijun, Tan Zhipeng, Wang Fang, Lu Kai. Survey of Log-Structured File Systems in Mobile Devices[J]. Journal of Computer Research and Development, 2025, 62(1): 58-74. DOI: 10.7544/issn1000-1239.202330789 |
[2] | Chen Huimin, Jin Sichen, Lin Wei, Zhu Zeyu, Tong Lingbo, Liu Yipeng, Ye Yining, Jiang Weihan, Liu Zhiyuan, Sun Maosong, Jin Jianbin. Quantitative Analysis on the Communication of COVID-19 Related Social Media Rumors[J]. Journal of Computer Research and Development, 2021, 58(7): 1366-1384. DOI: 10.7544/issn1000-1239.2021.20200818 |
[3] | Guo Hongyi, Liu Gongshen, Su Bo, Meng Kui. Collaborative Filtering Recommendation Algorithm Combining Community Structure and Interest Clusters[J]. Journal of Computer Research and Development, 2016, 53(8): 1664-1672. DOI: 10.7544/issn1000-1239.2016.20160175 |
[4] | Wang Di, Zhao Tianlei, Tang Yuxing, Dou Qiang. A Communication Feature-Oriented 3D NoC Architecture Design[J]. Journal of Computer Research and Development, 2014, 51(9): 1971-1979. DOI: 10.7544/issn1000-1239.2014.20130131 |
[5] | Chen Ping, Xing Xiao, Xin Zhi, Wang Yi, Mao Bing, and Xie Li. Protecting Programs Based on Randomizing the Encapsulated Structure[J]. Journal of Computer Research and Development, 2011, 48(12): 2227-2234. |
[6] | Li Shaofang, Hu Shanli, Shi Chunyi. An Anytime Coalition Structure Generation Based on the Grouping Idea of Cardinality Structure[J]. Journal of Computer Research and Development, 2011, 48(11): 2047-2054. |
[7] | Liu Jinglei, Zhang Wei, Liu Zhaowei, and Sun Xuejiao. Properties and Application of Coalition Structure Graph[J]. Journal of Computer Research and Development, 2011, 48(4): 602-609. |
[8] | Su Shexiong, Hu Shanli, Zheng Shengfu, Lin Chaofeng, and Luo Jianbin. An Anytime Coalition Structure Generation Algorithm Based on Cardinality Structure[J]. Journal of Computer Research and Development, 2008, 45(10): 1756. |
[9] | Cao Yafei, Wang Dawei, and Li Sikun. A Novel System-Level Communication Synthesis Methodology Containing Crossbar Bus and Shared Bus[J]. Journal of Computer Research and Development, 2008, 45(8): 1439-1445. |
[10] | Zheng Zhirong, Cai Yi, and Shen Changxiang. Research on an Application Class Communication Security Model on Operating System Security Framework[J]. Journal of Computer Research and Development, 2005, 42(2): 322-328. |
1. |
何业锋,刘闪闪,刘妍,权家辉,田哲铭,杨梦玫,李智. 支持虚拟车辆辅助假名更新的混合区位置隐私保护方案. 计算机应用研究. 2024(01): 272-276 .
![]() | |
2. |
况博裕,李雨泽,顾芳铭,苏铓,付安民. 车联网安全研究综述:威胁、对策与未来展望. 计算机研究与发展. 2023(10): 2304-2321 .
![]() | |
3. |
王佳星,周武源,李甜甜. 人工智能发展态势的文献计量分析与研究. 小型微型计算机系统. 2023(11): 2424-2433 .
![]() | |
4. |
张迪,曹利,李原帅. 车联网环境下基于多策略访问树的安全访问控制算法. 计算机应用研究. 2023(11): 3394-3401 .
![]() | |
5. |
邓雨康,张磊,李晶. 车联网隐私保护研究综述. 计算机应用研究. 2022(10): 2891-2906 .
![]() |