• 中国精品科技期刊
  • CCF推荐A类中文期刊
  • 计算领域高质量科技期刊T1类
Advanced Search
Chen Xuanting, Ye Junjie, Zu Can, Xu Nuo, Gui Tao, Zhang Qi. Robustness of GPT Large Language Models on Natural Language Processing Tasks[J]. Journal of Computer Research and Development, 2024, 61(5): 1128-1142. DOI: 10.7544/issn1000-1239.202330801
Citation: Chen Xuanting, Ye Junjie, Zu Can, Xu Nuo, Gui Tao, Zhang Qi. Robustness of GPT Large Language Models on Natural Language Processing Tasks[J]. Journal of Computer Research and Development, 2024, 61(5): 1128-1142. DOI: 10.7544/issn1000-1239.202330801

Robustness of GPT Large Language Models on Natural Language Processing Tasks

More Information
  • Author Bio:

    Chen Xuanting: born in 1999. Master candidate. Her main research interests include natural language processing and robust models

    Ye Junjie: born in 2001. PhD candidate. His main research interest includes natural language processing

    Zu Can: born in 2000. Master candidate. Her main research interests include large language models and information extraction

    Xu Nuo: born in 1998. Master candidate. Her main research interests include natural language processing and large language models

    Gui Tao: born in 1989. PhD, associate professor, master supervisor. His main research interests include pre-training models, information extraction, and robust models

    Zhang Qi: born in 1981. PhD, professor, PhD supervisor. His main research interests include natural language processing and information retrieval

  • Received Date: October 09, 2023
  • Revised Date: March 11, 2024
  • Available Online: March 11, 2024
  • The GPT models have demonstrated impressive performance in various natural language processing (NLP) tasks. However, their robustness and abilities to handle various complexities of the open world have not yet to be well explored, which is especially crucial in assessing the stability of models and is a key aspect of trustworthy AI. In this study, we perform a comprehensive experimental analysis of GPT-3 and GPT-3.5 series models, exploring their performance and robustness using 15 datasets (about 147000 original test samples) with 61 robust probing transformations from TextFlint covering 9 popular NLP tasks. Additionally, we analyze the model’s robustness across different transformation levels, including character, word, and sentence. Our findings reveal that while GPT models exhibit competitive performance in certain tasks like sentiment analysis, semantic matching, and reading comprehension, they exhibit severe confusion regarding information extraction tasks. For instance, GPT models exhibit severe confusion in relation extraction and even exhibit “hallucination” phenomena. Moreover, they experience significant degradation in robustness in terms of tasks and transformations, especially in classification tasks and sentence-level transformations. Furthermore, we validate the impact of the quantity and the form of demonstrations on performance and robustness. These findings reveal that GPT models are still not fully proficient in handling common NLP tasks, and highlight the difficulty in addressing robustness challenges through enhancing model performance or altering prompt content. By comparing the performance and robustness of the updated version of gpt-3.5-turbo, gpt-4, LLaMA2-7B and LLaMA2-13B, we further validate the experimental findings. Future studies on large language models should strive to enhance their capacities in information extraction and semantic understanding, while simultaneously bolstering overall robustness.

  • [1]
    Wei J, Bosma M, Zhao V Y, et al. Finetuned language models are zero-shot learners[J]. arXiv preprint, arXiv: 2109.01652, 2021
    [2]
    Brown T B, Mann B, Ryder N, et al. Language models are few-shot learners[C/OL]//Advances in Neural Information Processing Systems. [2023-09-10].https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html
    [3]
    Touvron H, Lavril T, Izacard G, et al. LLaMA: Open and efficient foundation language models[J]. arXiv preprint, arXiv: 2302.13971, 2023
    [4]
    Anil R, Dai A M, Firat O, et al. PaLM 2 technical report[J]. arXiv preprint, arXiv: 2305.10403, 2023
    [5]
    Kaplan J, McCandlish S, Henighan T, et al. Scaling laws for neural language models[J]. arXiv preprint, arXiv: 2001.08361, 2020
    [6]
    Qin Chengwei, Zhang A, Zhang Zhuosheng, et al. Is ChatGPT a general-purpose natural language processing task solver?[C]//Proc of the 2023 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2023: 1339–1384
    [7]
    Bang Y, Cahyawijaya S, Lee N, et al. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity[J]. arXiv preprint, arXiv: 2302.04023, 2023
    [8]
    Kosinski M. Theory of mind may have spontaneously emerged in large language models[J]. arXiv preprint, arXiv: 2302.02083, 2023
    [9]
    Frieder S, Pinchetti L, Griffiths R R, et al. Mathematical capabilities of ChatGPT[C/OL]//Advances in Neural Information Processing Systems. [2023-09-10]. https://neurips.cc/virtual/2023/poster/73421
    [10]
    Wang Jindong, Hu Xixu, Hou Wenxin, et al. On the robustness of ChatGPT: An adversarial and out-of-distribution perspective[J]. arXiv preprint, arXiv: 2302.12095, 2023
    [11]
    Wang Boxin, Xu Chejian, Wang Shuohang, et al. Adversarial glue: A multi-task benchmark for robustness evaluation of language models[J]. arXiv preprint, arXiv: 2111.02840, 2021
    [12]
    Nie Y, Williams A, Dinan E, et al. Adversarial NLI: A new benchmark for natural language understanding[C]//Proc of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2020: 4885−4901
    [13]
    Fansi T A, Goel R, Wen Zhi, et al. DDXPlus: A new dataset for automatic medical diagnosis[C/OL]//Advances in Neural Information Processing Systems. [2023-09-10].https://proceedings.neurips.cc/paper_files/paper/2022/hash/cae73a974390c0edd95ae7aeae09139c-Abstract-Datasets_and_Benchmarks.html
    [14]
    Zhu Kaijie, Wang Jindong, Zhou Jiaheng, et al. PromptBench: Towards evaluating the robustness of large language models on adversarial prompts[J]. arXiv preprint, arXiv: 2306.04528, 2023
    [15]
    Wang Xiao, Liu Qin, Gui Tao, et al. TextFlint: Unified multilingual robustness evaluation toolkit for natural language processing[C]//Proc of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th Int Joint Conf on Natural Language Processing: System Demonstrations. Stroudsburg, PA: ACL, 2021: 347−355
    [16]
    Dong Qingxiu, Li Lei, Dai Damai, et al. A survey for in-context learning[J]. arXiv preprint, arXiv: 2301.00234, 2022
    [17]
    Zhuo T Y, Huang Yujin, Chen Chunyang, et al. Exploring AI ethics of ChatGPT: A diagnostic analysis[J]. arXiv preprint, arXiv: 2301.12867, 2023
    [18]
    Choi J H, Hickman K E, Monahan A B, et al. ChatGPT goes to law school[J]. Journal of Legal Education, 2021, 71(3): 387
    [19]
    Khalil M, Er E. Will ChatGPT get you caught? Rethinking of plagiarism detection[C]//Proc of Int Conf on Human-Computer Interaction. Berlin: Springer, 2023: 475−487
    [20]
    Alshater M. Exploring the role of artificial intelligence in enhancing academic performance: A case study of ChatGPT[J/OL]. [2023-09-12]. http://dx. doi.org/10.2139/ssrn.4312358
    [21]
    Tabone W, De Winter J. Using ChatGPT for human–computer interaction research: A primer[J]. Royal Society Open Science, 2023, 10(9): 21
    [22]
    Jeblick K, Schachtner B, Dexl J, et al. ChatGPT makes medicine easy to swallow: An exploratory case study on simplified radiology reports[J]. European Radiology, 2023: 1−9
    [23]
    Biswas S. ChatGPT and the future of medical writing[J]. Radiology, 2023, 307(2): e223312 doi: 10.1148/radiol.223312
    [24]
    Goodfellow I J, Shlens J, Szegedy C. Explaining and harnessing adversarial examples[J]. arXiv preprint, arXiv: 1412.6572, 2014
    [25]
    Pontiki M, Galanis D, Pavlopoulos J, et al. SemEval-2014 Task 4: Aspect based sentiment analysis[C]//Proc of the 8th Int Workshop on Semantic Evaluation. Stroudsburg, PA: ACL, 2004: 27−35
    [26]
    Maas A L, Daly R E, Pham P T, et al. Learning word vectors for sentiment analysis[C]// Proc of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg, PA: ACL, 2011: 142−150
    [27]
    Williams A, Nangia N, Bowman S R. A broad-coverage challenge corpus for sentence understanding through inference[J]. arXiv preprint, arXiv: 1704.05426, 2017
    [28]
    Dolan W B, Brockett C. Automatically constructing a corpus of sentential paraphrases[C]//Proc of the 3rd Int Workshop on Paraphrasing (IWP2005). Jeju Island: Asia Federation of Natural Language Processing, 2005: 9−16
    [29]
    Wang Zhiguo, Hamza W, Florian R. Bilateral multi-perspective matching for natural language sentences[C]//Proc of the 26th Int Joint Conf on Artificial Intelligence. Australia: IJCAI. org, 2017: 4144−4150
    [30]
    Levesque H, Davis E, Morgenstern L. The winograd schema challenge[C]//Proc of 13th Int Conf on the Principles of Knowledge Representation and Reasoning. Palo Alto, CA: AAAI, 2012: 552−561
    [31]
    Rajpurkar P, Zhang Jian, Lopyrev K, et al. SQuAD: 100, 000+ questions for machine comprehension of text[C]//Proc of the 2016 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2016: 2383−2392
    [32]
    Rajpurkar P, Jia R, Liang P. Know what you don’t know: Unanswerable questions for SQuAD[C]//Proc of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Stroudsburg, PA: ACL, 2018: 784−789
    [33]
    Marcus M, Santorini B, Marcinkiewicz M A. Building a large annotated corpus of English: The Penn Treebank[J]. Computational Linguistics, 1993, 19(2): 313−330
    [34]
    Sang E T K, De Meulder F. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition[C]//Proc of the 7th Conf on Natural Language Learning at HLT-NAACL 2003. Stroudsburg, PA: ACL, 2003: 142−147
    [35]
    Weischedel R, Palmer M, Marcus M, et al. Ontonotes release 5.0 ldc2013t19[J]. Linguistic Data Consortium, 2013, 23(1): 170
    [36]
    Zhang Yuhao, Zhong V, Chen Danqi, et al. Position-aware attention and supervised data improve slot filling[C]//Proc of Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2017: 35−45
    [37]
    Ouyang L, Wu J, Jiang X, et al. Training language models to follow instructions with human feedback[C/OL]//Advances in Neural Information Processing Systems. [2023-09-10]. https://proceedings.neurips.cc/paper_files/paper/2022/hash/b1efde53be364a73914f58805a001731-Abstract-Conference.html
    [38]
    Christiano P F, Leike J, Brown T, et al. Deep reinforcement learning from human preferences[C/OL]// Advances in Neural Information Processing Systems. [2023-09-10]. https://papers.nips.cc/paper_files/paper/2017/hash/d5e2c0adad503c91f91df240d0cd4e49-Abstract.html
    [39]
    Kenton J D M W C, Toutanova L K. BERT: Pre-training of deep bidirectional Transformers for language understanding[C]//Proc of NAACL-HLT. Stroudsburg, PA: ACL, 2019: 4171−4186
    [40]
    Chen Lingjiao, Zaharia M, Zou J. How is ChatGPT’s behavior changing over time?[J]. arXiv preprint, arXiv: 2307.09009, 2023
    [41]
    Tu Shangqing, Li Chunyang, Yu Jifan, et al. ChatLog: Recording and analyzing ChatGPT across time[J]. arXiv preprint, arXiv: 2304.14106, 2023
  • Related Articles

    [1]Du Yuyue, Sun Ya’nan, Liu Wei. Petri Nets Based Recognition of Model Deviation Domains and Model Repair[J]. Journal of Computer Research and Development, 2016, 53(8): 1766-1780. DOI: 10.7544/issn1000-1239.2016.20160099
    [2]Zhu Jun, Guo Changguo, Wu Quanyuan. A Web Services Interaction Behavior-Environment Model Based on Generalized Stochastic Petri Nets[J]. Journal of Computer Research and Development, 2012, 49(11): 2450-2463.
    [3]Sun Cong, Tang Liyong, Chen Zhong, Ma Jianfeng. Secure Information Flow in Java by Optimized Reachability Analysis of Weighted Pushdown System[J]. Journal of Computer Research and Development, 2012, 49(5): 901-912.
    [4]Zhou Hang, Huang Zhiqiu, Zhu Yi, Xia Liang, Liu Linyuan. Real-Time Systems Contact Checking and Resolution Based on Time Petri Net[J]. Journal of Computer Research and Development, 2012, 49(2): 413-420.
    [5]Men Peng and Duan Zhenhua. Extension of Model Checking Tool of Colored Petri Nets and Its Applications in Web Service Composition[J]. Journal of Computer Research and Development, 2009, 46(8): 1294-1303.
    [6]Zhao Mingfeng, Song Wen, Yang Yixian. Confusion Detection Based on Petri-Net[J]. Journal of Computer Research and Development, 2008, 45(10): 1631-1637.
    [7]Cui Huanqing and Wu Zhehui. Structural Properties of Parallel Program's Petri Net Model[J]. Journal of Computer Research and Development, 2007, 44(12): 2130-2135.
    [8]Tang Da, Li Ye. Model Analysis of Supply Chain System Based on Color Stochastic Petri Net[J]. Journal of Computer Research and Development, 2007, 44(10): 1782-1789.
    [9]Lao Songyang, Huang Guanglian, Alan F. Smeaton, Gareth J. F. Jones, Hyowon Lee. A Query Description Model of Soccer Video Based on BSU Composite Petri-Net[J]. Journal of Computer Research and Development, 2006, 43(1): 159-168.
    [10]Li Botao and Luo Junzhou. Modeling and Analysis of Non-Repudiation Protocols by Using Petri Nets[J]. Journal of Computer Research and Development, 2005, 42(9): 1571-1577.
  • Cited by

    Periodical cited type(22)

    1. 黄蔚亮,李锦煊,余志文,蔡亚永,刘元. 确定性网络:架构、关键技术和应用. 重庆邮电大学学报(自然科学版). 2025(01): 1-16 .
    2. 姜旭艳,全巍,付文文,张小亮,孙志刚. OpenPlanner:一个开源的时间敏感网络流量规划器. 计算机研究与发展. 2025(05): 1307-1329 . 本站查看
    3. 齐玉玲,黄涛,张军贤,贾焱鑫,徐龙,熊伟,朱海龙,彭开来. 基于时间敏感网络的列车通信网络研究及应用. 城市轨道交通研究. 2024(05): 184-189 .
    4. 何倩,郭雅楠,赵宝康,潘琪,王勇. 无等待与时隙映射复用结合的时间触发流调度方法. 通信学报. 2024(08): 192-204 .
    5. 郭若彤,许方敏,张恒升,赵成林. 基于循环排队转发的时间触发流量路由与调度优化方法. 微电子学与计算机. 2024(10): 55-63 .
    6. 薛强,吴梦,杨世标,屠礼彪,李伟,廖江. 嵌入式人工智能技术在IP网络的创新应用. 邮电设计技术. 2024(10): 66-72 .
    7. 张浩,郭偶凡,周飞飞,马涛,何迎利,姚苏滨. 基于分段帧复制和消除的时间敏感网络动态冗余机制研究. 计算机科学. 2024(S2): 750-756 .
    8. 罗峰,周杰,王子通,张晓先,孙志鹏. 基于多域冗余的车载时间敏感网络时间同步增强方法. 系统工程与电子技术. 2024(12): 4259-4268 .
    9. 王雪荣,唐政治,李银川,齐美玉,朱建波,张亮. 基于优化决策树的时延敏感流智能感知调度. 电信科学. 2023(04): 120-132 .
    10. 陆以勤,谢文静,王海瀚,陈卓星,程喆,潘伟锵,覃健诚. 面向时间敏感网络的安全感知调度方法. 华南理工大学学报(自然科学版). 2023(05): 1-12 .
    11. 朱渊,胡馨予,吴思远,黄蓉. 基于OMNeT++的5G-TSN调度算法综述. 西安邮电大学学报. 2023(01): 9-18 .
    12. 王家兴,杨思锦,庄雷,宋玉,阳鑫宇. 时间敏感网络中多目标在线混合流量调度算法. 计算机科学. 2023(07): 286-292 .
    13. 李维,梁巍,周策. 基于ACO算法的SDN网络流量调度优化研究. 自动化与仪器仪表. 2023(07): 42-46 .
    14. 吴昭祥,李文凯,袁亚洲,刘志新. 时间敏感网络中基于抢占式通道模型的资源调度算法研究. 移动通信. 2023(08): 67-73+97 .
    15. 刘美鹭,刘留,王凯,韩紫杰. 真空管高速飞行列车通信业务建模. 移动通信. 2023(08): 98-106 .
    16. 彭紫梅,寿国础,郭梦杰,刘雅琼,胡怡红. 时间敏感网络中的冗余机制研究综述. 电信科学. 2023(08): 29-42 .
    17. 胡文学,孙雷,王健全,朱渊,毕紫航. 基于网络演算的时间敏感网络时延上界分析模型研究. 自动化学报. 2023(11): 2297-2310 .
    18. 王新蕾,周敏,张涛. 时间敏感网络流量调度算法研究综述. 电讯技术. 2023(11): 1830-1838 .
    19. 段晓东,刘鹏,陆璐,孙滔,李志强. 确定性网络技术综述. 电信科学. 2023(11): 1-12 .
    20. 陆以勤,熊欣,王猛,覃健诚,潘伟锵. TSN中基于链路负载均衡的AVB流量带宽分配方法. 华南理工大学学报(自然科学版). 2023(11): 1-9 .
    21. 周阳,陈鸿龙,张雷. 时间敏感网络中的动态路由与调度联合优化算法. 物联网学报. 2023(04): 52-62 .
    22. 裴金川,胡宇翔,田乐,胡涛,李子勇. 联合路由规划的时间敏感网络流量调度方法. 通信学报. 2022(12): 54-65 .

    Other cited types(36)

Catalog

    Article views (610) PDF downloads (217) Cited by(58)

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return