Advanced Search
    Chen Xuanting, Ye Junjie, Zu Can, Xu Nuo, Gui Tao, Zhang Qi. Robustness of GPT Models on Natural Language Processing Tasks[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202330801
    Citation: Chen Xuanting, Ye Junjie, Zu Can, Xu Nuo, Gui Tao, Zhang Qi. Robustness of GPT Models on Natural Language Processing Tasks[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202330801

    Robustness of GPT Models on Natural Language Processing Tasks

    • The GPT models have demonstrated impressive performance in various Natural Language Processing (NLP) tasks. However, their robustness and abilities to handle various complexities of the open world have yet to be well explored, which is especially crucial in assessing the stability of models and is a key aspect of trustworthy AI. In this study, we perform a comprehensive experimental analysis of GPT-3 and GPT-3.5 series models, exploring their performance and robustness using 15 datasets (about 147K original test samples) with 61 robust probing transformations from TextFlint covering 9 popular NLP tasks. Additionally, we analyze the model's robustness across different transformation levels, including character, word, and sentence. Our findings reveal that while GPT models exhibit competitive performance in certain tasks like sentiment analysis, semantic matching, and reading comprehension, they exhibit severe confusion regarding information extraction tasks. For instance, they exhibit severe confusion in relation extraction and even exhibit "hallucination" phenomena. Moreover, they experience significant degradation in robustness both in terms of tasks and transformations, especially in classification tasks and sentence-level transformations. Furthermore, we validate the impact of the quantity and form of demonstrations on performance and robustness. These findings reveal that GPT models are still not fully proficient in handling common NLP tasks, and highlight the difficulty in addressing robustness challenges through enhancing model performance or altering prompt content. By comparing the performance and robustness of the updated version of gpt-3.5-turbo, gpt-4, Llama2-7b and Llama2-13b, we further validate the experimental findings. Future studies on large language models should strive to enhance their capacities in information extraction and semantic understanding, while simultaneously bolstering overall robustness.
    • loading

    Catalog

      Turn off MathJax
      Article Contents

      /

      DownLoad:  Full-Size Img  PowerPoint
      Return
      Return