• 中国精品科技期刊
  • CCF推荐A类中文期刊
  • 计算领域高质量科技期刊T1类
Advanced Search
Chen Xuanting, Ye Junjie, Zu Can, Xu Nuo, Gui Tao, Zhang Qi. Robustness of GPT Large Language Models on Natural Language Processing Tasks[J]. Journal of Computer Research and Development, 2024, 61(5): 1128-1142. DOI: 10.7544/issn1000-1239.202330801
Citation: Chen Xuanting, Ye Junjie, Zu Can, Xu Nuo, Gui Tao, Zhang Qi. Robustness of GPT Large Language Models on Natural Language Processing Tasks[J]. Journal of Computer Research and Development, 2024, 61(5): 1128-1142. DOI: 10.7544/issn1000-1239.202330801

Robustness of GPT Large Language Models on Natural Language Processing Tasks

More Information
  • Author Bio:

    Chen Xuanting: born in 1999. Master candidate. Her main research interests include natural language processing and robust models

    Ye Junjie: born in 2001. PhD candidate. His main research interest includes natural language processing

    Zu Can: born in 2000. Master candidate. Her main research interests include large language models and information extraction

    Xu Nuo: born in 1998. Master candidate. Her main research interests include natural language processing and large language models

    Gui Tao: born in 1989. PhD, associate professor, master supervisor. His main research interests include pre-training models, information extraction, and robust models

    Zhang Qi: born in 1981. PhD, professor, PhD supervisor. His main research interests include natural language processing and information retrieval

  • Received Date: October 09, 2023
  • Revised Date: March 11, 2024
  • Available Online: March 11, 2024
  • The GPT models have demonstrated impressive performance in various natural language processing (NLP) tasks. However, their robustness and abilities to handle various complexities of the open world have not yet to be well explored, which is especially crucial in assessing the stability of models and is a key aspect of trustworthy AI. In this study, we perform a comprehensive experimental analysis of GPT-3 and GPT-3.5 series models, exploring their performance and robustness using 15 datasets (about 147000 original test samples) with 61 robust probing transformations from TextFlint covering 9 popular NLP tasks. Additionally, we analyze the model’s robustness across different transformation levels, including character, word, and sentence. Our findings reveal that while GPT models exhibit competitive performance in certain tasks like sentiment analysis, semantic matching, and reading comprehension, they exhibit severe confusion regarding information extraction tasks. For instance, GPT models exhibit severe confusion in relation extraction and even exhibit “hallucination” phenomena. Moreover, they experience significant degradation in robustness in terms of tasks and transformations, especially in classification tasks and sentence-level transformations. Furthermore, we validate the impact of the quantity and the form of demonstrations on performance and robustness. These findings reveal that GPT models are still not fully proficient in handling common NLP tasks, and highlight the difficulty in addressing robustness challenges through enhancing model performance or altering prompt content. By comparing the performance and robustness of the updated version of gpt-3.5-turbo, gpt-4, LLaMA2-7B and LLaMA2-13B, we further validate the experimental findings. Future studies on large language models should strive to enhance their capacities in information extraction and semantic understanding, while simultaneously bolstering overall robustness.

  • [1]
    Wei J, Bosma M, Zhao V Y, et al. Finetuned language models are zero-shot learners[J]. arXiv preprint, arXiv: 2109.01652, 2021
    [2]
    Brown T B, Mann B, Ryder N, et al. Language models are few-shot learners[C/OL]//Advances in Neural Information Processing Systems. [2023-09-10].https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html
    [3]
    Touvron H, Lavril T, Izacard G, et al. LLaMA: Open and efficient foundation language models[J]. arXiv preprint, arXiv: 2302.13971, 2023
    [4]
    Anil R, Dai A M, Firat O, et al. PaLM 2 technical report[J]. arXiv preprint, arXiv: 2305.10403, 2023
    [5]
    Kaplan J, McCandlish S, Henighan T, et al. Scaling laws for neural language models[J]. arXiv preprint, arXiv: 2001.08361, 2020
    [6]
    Qin Chengwei, Zhang A, Zhang Zhuosheng, et al. Is ChatGPT a general-purpose natural language processing task solver?[C]//Proc of the 2023 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2023: 1339–1384
    [7]
    Bang Y, Cahyawijaya S, Lee N, et al. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity[J]. arXiv preprint, arXiv: 2302.04023, 2023
    [8]
    Kosinski M. Theory of mind may have spontaneously emerged in large language models[J]. arXiv preprint, arXiv: 2302.02083, 2023
    [9]
    Frieder S, Pinchetti L, Griffiths R R, et al. Mathematical capabilities of ChatGPT[C/OL]//Advances in Neural Information Processing Systems. [2023-09-10]. https://neurips.cc/virtual/2023/poster/73421
    [10]
    Wang Jindong, Hu Xixu, Hou Wenxin, et al. On the robustness of ChatGPT: An adversarial and out-of-distribution perspective[J]. arXiv preprint, arXiv: 2302.12095, 2023
    [11]
    Wang Boxin, Xu Chejian, Wang Shuohang, et al. Adversarial glue: A multi-task benchmark for robustness evaluation of language models[J]. arXiv preprint, arXiv: 2111.02840, 2021
    [12]
    Nie Y, Williams A, Dinan E, et al. Adversarial NLI: A new benchmark for natural language understanding[C]//Proc of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2020: 4885−4901
    [13]
    Fansi T A, Goel R, Wen Zhi, et al. DDXPlus: A new dataset for automatic medical diagnosis[C/OL]//Advances in Neural Information Processing Systems. [2023-09-10].https://proceedings.neurips.cc/paper_files/paper/2022/hash/cae73a974390c0edd95ae7aeae09139c-Abstract-Datasets_and_Benchmarks.html
    [14]
    Zhu Kaijie, Wang Jindong, Zhou Jiaheng, et al. PromptBench: Towards evaluating the robustness of large language models on adversarial prompts[J]. arXiv preprint, arXiv: 2306.04528, 2023
    [15]
    Wang Xiao, Liu Qin, Gui Tao, et al. TextFlint: Unified multilingual robustness evaluation toolkit for natural language processing[C]//Proc of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th Int Joint Conf on Natural Language Processing: System Demonstrations. Stroudsburg, PA: ACL, 2021: 347−355
    [16]
    Dong Qingxiu, Li Lei, Dai Damai, et al. A survey for in-context learning[J]. arXiv preprint, arXiv: 2301.00234, 2022
    [17]
    Zhuo T Y, Huang Yujin, Chen Chunyang, et al. Exploring AI ethics of ChatGPT: A diagnostic analysis[J]. arXiv preprint, arXiv: 2301.12867, 2023
    [18]
    Choi J H, Hickman K E, Monahan A B, et al. ChatGPT goes to law school[J]. Journal of Legal Education, 2021, 71(3): 387
    [19]
    Khalil M, Er E. Will ChatGPT get you caught? Rethinking of plagiarism detection[C]//Proc of Int Conf on Human-Computer Interaction. Berlin: Springer, 2023: 475−487
    [20]
    Alshater M. Exploring the role of artificial intelligence in enhancing academic performance: A case study of ChatGPT[J/OL]. [2023-09-12]. http://dx. doi.org/10.2139/ssrn.4312358
    [21]
    Tabone W, De Winter J. Using ChatGPT for human–computer interaction research: A primer[J]. Royal Society Open Science, 2023, 10(9): 21
    [22]
    Jeblick K, Schachtner B, Dexl J, et al. ChatGPT makes medicine easy to swallow: An exploratory case study on simplified radiology reports[J]. European Radiology, 2023: 1−9
    [23]
    Biswas S. ChatGPT and the future of medical writing[J]. Radiology, 2023, 307(2): e223312 doi: 10.1148/radiol.223312
    [24]
    Goodfellow I J, Shlens J, Szegedy C. Explaining and harnessing adversarial examples[J]. arXiv preprint, arXiv: 1412.6572, 2014
    [25]
    Pontiki M, Galanis D, Pavlopoulos J, et al. SemEval-2014 Task 4: Aspect based sentiment analysis[C]//Proc of the 8th Int Workshop on Semantic Evaluation. Stroudsburg, PA: ACL, 2004: 27−35
    [26]
    Maas A L, Daly R E, Pham P T, et al. Learning word vectors for sentiment analysis[C]// Proc of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg, PA: ACL, 2011: 142−150
    [27]
    Williams A, Nangia N, Bowman S R. A broad-coverage challenge corpus for sentence understanding through inference[J]. arXiv preprint, arXiv: 1704.05426, 2017
    [28]
    Dolan W B, Brockett C. Automatically constructing a corpus of sentential paraphrases[C]//Proc of the 3rd Int Workshop on Paraphrasing (IWP2005). Jeju Island: Asia Federation of Natural Language Processing, 2005: 9−16
    [29]
    Wang Zhiguo, Hamza W, Florian R. Bilateral multi-perspective matching for natural language sentences[C]//Proc of the 26th Int Joint Conf on Artificial Intelligence. Australia: IJCAI. org, 2017: 4144−4150
    [30]
    Levesque H, Davis E, Morgenstern L. The winograd schema challenge[C]//Proc of 13th Int Conf on the Principles of Knowledge Representation and Reasoning. Palo Alto, CA: AAAI, 2012: 552−561
    [31]
    Rajpurkar P, Zhang Jian, Lopyrev K, et al. SQuAD: 100, 000+ questions for machine comprehension of text[C]//Proc of the 2016 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2016: 2383−2392
    [32]
    Rajpurkar P, Jia R, Liang P. Know what you don’t know: Unanswerable questions for SQuAD[C]//Proc of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Stroudsburg, PA: ACL, 2018: 784−789
    [33]
    Marcus M, Santorini B, Marcinkiewicz M A. Building a large annotated corpus of English: The Penn Treebank[J]. Computational Linguistics, 1993, 19(2): 313−330
    [34]
    Sang E T K, De Meulder F. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition[C]//Proc of the 7th Conf on Natural Language Learning at HLT-NAACL 2003. Stroudsburg, PA: ACL, 2003: 142−147
    [35]
    Weischedel R, Palmer M, Marcus M, et al. Ontonotes release 5.0 ldc2013t19[J]. Linguistic Data Consortium, 2013, 23(1): 170
    [36]
    Zhang Yuhao, Zhong V, Chen Danqi, et al. Position-aware attention and supervised data improve slot filling[C]//Proc of Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2017: 35−45
    [37]
    Ouyang L, Wu J, Jiang X, et al. Training language models to follow instructions with human feedback[C/OL]//Advances in Neural Information Processing Systems. [2023-09-10]. https://proceedings.neurips.cc/paper_files/paper/2022/hash/b1efde53be364a73914f58805a001731-Abstract-Conference.html
    [38]
    Christiano P F, Leike J, Brown T, et al. Deep reinforcement learning from human preferences[C/OL]// Advances in Neural Information Processing Systems. [2023-09-10]. https://papers.nips.cc/paper_files/paper/2017/hash/d5e2c0adad503c91f91df240d0cd4e49-Abstract.html
    [39]
    Kenton J D M W C, Toutanova L K. BERT: Pre-training of deep bidirectional Transformers for language understanding[C]//Proc of NAACL-HLT. Stroudsburg, PA: ACL, 2019: 4171−4186
    [40]
    Chen Lingjiao, Zaharia M, Zou J. How is ChatGPT’s behavior changing over time?[J]. arXiv preprint, arXiv: 2307.09009, 2023
    [41]
    Tu Shangqing, Li Chunyang, Yu Jifan, et al. ChatLog: Recording and analyzing ChatGPT across time[J]. arXiv preprint, arXiv: 2304.14106, 2023
  • Related Articles

    [1]Yang Lihua, Dong Yong, Wu Huijun, Tan Zhipeng, Wang Fang, Lu Kai. Survey of Log-Structured File Systems in Mobile Devices[J]. Journal of Computer Research and Development, 2025, 62(1): 58-74. DOI: 10.7544/issn1000-1239.202330789
    [2]Chen Huimin, Jin Sichen, Lin Wei, Zhu Zeyu, Tong Lingbo, Liu Yipeng, Ye Yining, Jiang Weihan, Liu Zhiyuan, Sun Maosong, Jin Jianbin. Quantitative Analysis on the Communication of COVID-19 Related Social Media Rumors[J]. Journal of Computer Research and Development, 2021, 58(7): 1366-1384. DOI: 10.7544/issn1000-1239.2021.20200818
    [3]Guo Hongyi, Liu Gongshen, Su Bo, Meng Kui. Collaborative Filtering Recommendation Algorithm Combining Community Structure and Interest Clusters[J]. Journal of Computer Research and Development, 2016, 53(8): 1664-1672. DOI: 10.7544/issn1000-1239.2016.20160175
    [4]Wang Di, Zhao Tianlei, Tang Yuxing, Dou Qiang. A Communication Feature-Oriented 3D NoC Architecture Design[J]. Journal of Computer Research and Development, 2014, 51(9): 1971-1979. DOI: 10.7544/issn1000-1239.2014.20130131
    [5]Chen Ping, Xing Xiao, Xin Zhi, Wang Yi, Mao Bing, and Xie Li. Protecting Programs Based on Randomizing the Encapsulated Structure[J]. Journal of Computer Research and Development, 2011, 48(12): 2227-2234.
    [6]Li Shaofang, Hu Shanli, Shi Chunyi. An Anytime Coalition Structure Generation Based on the Grouping Idea of Cardinality Structure[J]. Journal of Computer Research and Development, 2011, 48(11): 2047-2054.
    [7]Liu Jinglei, Zhang Wei, Liu Zhaowei, and Sun Xuejiao. Properties and Application of Coalition Structure Graph[J]. Journal of Computer Research and Development, 2011, 48(4): 602-609.
    [8]Su Shexiong, Hu Shanli, Zheng Shengfu, Lin Chaofeng, and Luo Jianbin. An Anytime Coalition Structure Generation Algorithm Based on Cardinality Structure[J]. Journal of Computer Research and Development, 2008, 45(10): 1756.
    [9]Cao Yafei, Wang Dawei, and Li Sikun. A Novel System-Level Communication Synthesis Methodology Containing Crossbar Bus and Shared Bus[J]. Journal of Computer Research and Development, 2008, 45(8): 1439-1445.
    [10]Zheng Zhirong, Cai Yi, and Shen Changxiang. Research on an Application Class Communication Security Model on Operating System Security Framework[J]. Journal of Computer Research and Development, 2005, 42(2): 322-328.
  • Cited by

    Periodical cited type(5)

    1. 何业锋,刘闪闪,刘妍,权家辉,田哲铭,杨梦玫,李智. 支持虚拟车辆辅助假名更新的混合区位置隐私保护方案. 计算机应用研究. 2024(01): 272-276 .
    2. 况博裕,李雨泽,顾芳铭,苏铓,付安民. 车联网安全研究综述:威胁、对策与未来展望. 计算机研究与发展. 2023(10): 2304-2321 . 本站查看
    3. 王佳星,周武源,李甜甜. 人工智能发展态势的文献计量分析与研究. 小型微型计算机系统. 2023(11): 2424-2433 .
    4. 张迪,曹利,李原帅. 车联网环境下基于多策略访问树的安全访问控制算法. 计算机应用研究. 2023(11): 3394-3401 .
    5. 邓雨康,张磊,李晶. 车联网隐私保护研究综述. 计算机应用研究. 2022(10): 2891-2906 .

    Other cited types(2)

Catalog

    Article views (629) PDF downloads (221) Cited by(7)

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return