• 中国精品科技期刊
  • CCF推荐A类中文期刊
  • 计算领域高质量科技期刊T1类
高级检索

GPT系列大语言模型在自然语言处理任务中的鲁棒性

陈炫婷, 叶俊杰, 祖璨, 许诺, 桂韬, 张奇

陈炫婷, 叶俊杰, 祖璨, 许诺, 桂韬, 张奇. GPT系列大语言模型在自然语言处理任务中的鲁棒性[J]. 计算机研究与发展, 2024, 61(5): 1128-1142. DOI: 10.7544/issn1000-1239.202330801
引用本文: 陈炫婷, 叶俊杰, 祖璨, 许诺, 桂韬, 张奇. GPT系列大语言模型在自然语言处理任务中的鲁棒性[J]. 计算机研究与发展, 2024, 61(5): 1128-1142. DOI: 10.7544/issn1000-1239.202330801
Chen Xuanting, Ye Junjie, Zu Can, Xu Nuo, Gui Tao, Zhang Qi. Robustness of GPT Large Language Models on Natural Language Processing Tasks[J]. Journal of Computer Research and Development, 2024, 61(5): 1128-1142. DOI: 10.7544/issn1000-1239.202330801
Citation: Chen Xuanting, Ye Junjie, Zu Can, Xu Nuo, Gui Tao, Zhang Qi. Robustness of GPT Large Language Models on Natural Language Processing Tasks[J]. Journal of Computer Research and Development, 2024, 61(5): 1128-1142. DOI: 10.7544/issn1000-1239.202330801
陈炫婷, 叶俊杰, 祖璨, 许诺, 桂韬, 张奇. GPT系列大语言模型在自然语言处理任务中的鲁棒性[J]. 计算机研究与发展, 2024, 61(5): 1128-1142. CSTR: 32373.14.issn1000-1239.202330801
引用本文: 陈炫婷, 叶俊杰, 祖璨, 许诺, 桂韬, 张奇. GPT系列大语言模型在自然语言处理任务中的鲁棒性[J]. 计算机研究与发展, 2024, 61(5): 1128-1142. CSTR: 32373.14.issn1000-1239.202330801
Chen Xuanting, Ye Junjie, Zu Can, Xu Nuo, Gui Tao, Zhang Qi. Robustness of GPT Large Language Models on Natural Language Processing Tasks[J]. Journal of Computer Research and Development, 2024, 61(5): 1128-1142. CSTR: 32373.14.issn1000-1239.202330801
Citation: Chen Xuanting, Ye Junjie, Zu Can, Xu Nuo, Gui Tao, Zhang Qi. Robustness of GPT Large Language Models on Natural Language Processing Tasks[J]. Journal of Computer Research and Development, 2024, 61(5): 1128-1142. CSTR: 32373.14.issn1000-1239.202330801

GPT系列大语言模型在自然语言处理任务中的鲁棒性

详细信息
    作者简介:

    陈炫婷: 1999年生. 硕士研究生. 主要研究方向为自然语言处理、鲁棒模型

    叶俊杰: 2001年生. 博士研究生. 主要研究方向为自然语言处理

    祖璨: 2000年生. 硕士研究生. 主要研究方向为大语言模型、信息抽取

    许诺: 1998年生. 硕士研究生. 主要研究方向为自然语言处理、大语言模型

    桂韬: 1989年生. 博士,副研究员,硕士生导师. 主要研究方向为预训练模型、信息抽取、鲁棒模型

    张奇: 1981年生. 博士,教授,博士生导师. 主要研究方向为自然语言处理、信息检索

    通讯作者:

    桂韬(tgui@fudan.edu.cn

  • 中图分类号: TP391

Robustness of GPT Large Language Models on Natural Language Processing Tasks

More Information
    Author Bio:

    Chen Xuanting: born in 1999. Master candidate. Her main research interests include natural language processing and robust models

    Ye Junjie: born in 2001. PhD candidate. His main research interest includes natural language processing

    Zu Can: born in 2000. Master candidate. Her main research interests include large language models and information extraction

    Xu Nuo: born in 1998. Master candidate. Her main research interests include natural language processing and large language models

    Gui Tao: born in 1989. PhD, associate professor, master supervisor. His main research interests include pre-training models, information extraction, and robust models

    Zhang Qi: born in 1981. PhD, professor, PhD supervisor. His main research interests include natural language processing and information retrieval

  • 摘要:

    大语言模型(large language models,LLMs)所展现的处理各种自然语言处理(natural language processing,NLP)任务的能力引发了广泛关注. 然而,它们在处理现实中各种复杂场景时的鲁棒性尚未得到充分探索,这对于评估模型的稳定性和可靠性尤为重要. 因此,使用涵盖了9个常见NLP任务的15个数据集(约147000个原始测试样本)和来自TextFlint的61种鲁棒的文本变形方法分析GPT-3和GPT-3.5系列模型在原始数据集上的性能,以及其在不同任务和文本变形级别(字符、词和句子)上的鲁棒性. 研究结果表明,GPT模型虽然在情感分析、语义匹配等分类任务和阅读理解任务中表现出良好的性能,但其处理信息抽取任务的能力仍较为欠缺,比如其对关系抽取任务中各种关系类型存在严重混淆,甚至出现“幻觉”现象. 在鲁棒性评估实验中,GPT模型在任务层面和变形层面的鲁棒性都较弱,其中,在分类任务和句子级别的变形中鲁棒性缺乏更为显著. 此外,探究了模型迭代过程中性能和鲁棒性的变化,以及提示中的演示数量和演示内容对模型性能和鲁棒性的影响. 结果表明,随着模型的迭代以及上下文学习的加入,模型的性能稳步提升,但是鲁棒性依然亟待提升. 这些发现从任务类型、变形种类、提示内容等方面揭示了GPT模型还无法完全胜任常见的NLP任务,并且模型存在的鲁棒性问题难以通过提升模型性能或改变提示内容等方式解决. 通过对gpt-3.5-turbo的更新版本、gpt-4模型,以及开源模型LLaMA2-7B和LLaMA2-13B的性能和鲁棒性表现进行对比,进一步验证了实验结论. 鉴于此,未来的大模型研究应当提升模型在信息提取以及语义理解等方面的能力,并且应当在模型训练或微调阶段考虑提升其鲁棒性.

    Abstract:

    The GPT models have demonstrated impressive performance in various natural language processing (NLP) tasks. However, their robustness and abilities to handle various complexities of the open world have not yet to be well explored, which is especially crucial in assessing the stability of models and is a key aspect of trustworthy AI. In this study, we perform a comprehensive experimental analysis of GPT-3 and GPT-3.5 series models, exploring their performance and robustness using 15 datasets (about 147000 original test samples) with 61 robust probing transformations from TextFlint covering 9 popular NLP tasks. Additionally, we analyze the model’s robustness across different transformation levels, including character, word, and sentence. Our findings reveal that while GPT models exhibit competitive performance in certain tasks like sentiment analysis, semantic matching, and reading comprehension, they exhibit severe confusion regarding information extraction tasks. For instance, GPT models exhibit severe confusion in relation extraction and even exhibit “hallucination” phenomena. Moreover, they experience significant degradation in robustness in terms of tasks and transformations, especially in classification tasks and sentence-level transformations. Furthermore, we validate the impact of the quantity and the form of demonstrations on performance and robustness. These findings reveal that GPT models are still not fully proficient in handling common NLP tasks, and highlight the difficulty in addressing robustness challenges through enhancing model performance or altering prompt content. By comparing the performance and robustness of the updated version of gpt-3.5-turbo, gpt-4, LLaMA2-7B and LLaMA2-13B, we further validate the experimental findings. Future studies on large language models should strive to enhance their capacities in information extraction and semantic understanding, while simultaneously bolstering overall robustness.

  • https://platform.openai.com
    https://platform.openai.com/docs
  • 图  1   5个GPT-3和GPT-3.5系列模型的迭代过程

    Figure  1.   The evolution of five GPT-3 and GPT-3.5 series models

    图  2   实验评测流程图

    Figure  2.   Overview of experimental evaluating process

    图  3   GPT-3.5模型和BERT的性能表现

    注: “Laptop”和“Restaurant”分别表示“SemEval2014-Laptop”和“SemEval2014-Restaurant”数据集.

    Figure  3.   Performance of GPT-3.5 models and BERT

    图  4   在CoNLL2003数据集上的错误预测的分布

    Figure  4.   Distribution of prediction errors in CoNLL2003 dateset

    图  5   在TACRED数据集上的错误预测的分布

    Figure  5.   Distribution of prediction errors in TACRED dataset

    图  6   GPT模型的性能变化

    Figure  6.   Performance variations of GPT models

    图  7   GPT模型的平均性能下降率的变化

    Figure  7.   APDR variations of GPT models

    图  8   不同模型在3种变形类别上的性能下降情况

    Figure  8.   Performance drop of different models on three transformation categories

    图  9   GPT模型在0-shot、1-shot、3-shot样本场景下原始性能与变形后的性能表现

    Figure  9.   Original and transformed performance of GPT models on 0-shot, 1-shot, and 3-shot

    图  10   模型使用原始和变形后的演示数据的APDR

    Figure  10.   APDR with original and transformed demonstrations date

    图  11   GPT和LLaMA2模型的性能表现

    注:“Laptop”和“Restaurant”分别表示“SemEval2014-Laptop”和“SemEval2014-Restaurant”数据集. 柱状图中WSJ和TACRED数据集空缺的部分表示模型未完成在该数据集上的指定任务.

    Figure  11.   Performance of GPT and LLaMA2 models

    表  1   实验使用的15个数据集的信息

    Table  1   Information of 15 Datasets Used in Experiments

    任务类型 子任务类型 数据集 数据量 评测指标
    分类 细粒度情感分析(ABSA) SemEval2014-Laptop[25] 331 准确率
    SemEval2014-Restaurant[25] 492 准确率
    情感分析(SA) IMDB[26] 25000 准确率
    自然语言推理(NLI) MNLI-m[27] 9815 准确率
    MNLI-mm[27] 9832 准确率
    SNLI[27] 10000 准确率
    语义匹配(SM) QQP[28] 40430 准确率
    MRPC[29] 1725 准确率
    威诺格拉德模式挑战(WSC) WSC273[30] 570 准确率
    阅读理解 机器阅读理解(MRC) SQuAD 1.1[31] 9868 F1
    SQuAD 2.0[32] 11491 F1
    信息抽取 词性标注(POS) WSJ[33] 5461 准确率
    命名实体识别(NER) CoNLL2003[34] 3453 F1
    OntoNotesv5[35] 4019 F1
    关系抽取(RE) TACRED[36] 15509 F1
    下载: 导出CSV

    表  2   61种任务特定变形的信息

    Table  2   Information of 61 Task-Specific Transformations

    子任务类型 变形类型 变形方式
    细粒度情感分析(ABSA)句子级AddDiff, RevNon, RevTgt
    情感分析(SA)词级SwapSpecialEnt-Movie,
    SwapSpecialEnt-Person
    句子级AddSum-Movie, AddSum-Person, DoubleDenial
    自然语言推理(NLI)字符级NumWord
    词级SwapAnt
    句子级AddSent
    语义匹配(SM)字符级NumWord
    词级SwapAnt
    威诺格拉德模式挑战(WSC)字符级SwapNames
    词级SwapGender
    句子级AddSentences, InsertRelativeClause, SwitchVoice
    机器阅读理解(MRC)句子级AddSentDiverse, ModifyPos, PerturbAnswer, PerturbQuestion-BackTranslation, PertyrbQuestion-MLM
    词性标注(POS)字符级SwapPrefix
    词级SwapMultiPOSJJ, SwapMultiPOSNN, SwapMultiPOSRB, SwapMutliPOSVB
    命名实体识别(NER)字符级EntTypos, OOV
    词级CrossCategory, SwapLonger
    句子级ConcatSent
    关系抽取(RE)词级SwapEnt-LowFreq, SwapEnt-SamEtype
    句子级InsertClause, SwapTriplePos-Age, SwapTriplePos-Birth, SwapTriplePos-Employee
    下载: 导出CSV

    表  3   不同类型的变形样例

    Table  3   Examples of Deformations in Different Categories

    变形类型 变形方式 样例
    字符级 SwapPrefix 原始:That is a prefixed string.
    变形后:That is a preunfixed string.
    词级 DoubleDenial 原始:The leading actor is good.
    变形后:The leading actor is good not bad.
    句子级 InsertClause 原始:Shanghai is in the east of China.
    变形后:Shanghai which is a municipality of China
    is in the east of China established in Tiananmen.
    注: 划线单词表示变形后的数据中删掉的部分;黑体单词表示变形后的数据中新增的部分.
    下载: 导出CSV

    表  4   不同模型的鲁棒性表现

    Table  4   The Robustness Performance of Different Models %

    数据集 gpt-3.5-turbo text-davinci-003 BERT
    ori trans APDR ori trans APDR ori trans APDR
    Restaurant 91.43±1.23 66.00±11.28 27.80±2.74 90.14±1.33 52.59±11.21 41.65±4.26 84.38±1.20 53.49±15.07 36.51±18.43
    Laptop 86.67±2.15 59.36±21.97 31.25±23.31 83.30±0.71 54.71±17.75 34.42±19.29 90.48±0.06 49.06±9.03 45.78±9.97
    IMDB 91.60±0.20 90.86±0.50 0.80±0.47 91.74±0.68 91.40±0.58 0.37±0.31 95.24±0.12 94.61±0.80 0.66±0.94
    MNLI-m 73.03±7.44 41.75±17.05 42.27±21.87 67.49±2.80 54.88±20.93 19.52±24.60 86.31±4.50 52.49±2.97 39.10±4.13
    MNLI-mm 72.21±7.69 40.94±19.11 42.71±24.31 66.61±1.57 50.57±20.58 24.46±27.71 84.17±1.09 52.33±5.44 37.87±5.73
    SNLI 73.30±12.50 47.80±8.80 32.99±13.66 70.81±9.24 56.44±22.68 18.99±26.16 90.75±1.52 77.61±18.34 14.44±20.25
    QQP 79.32±5.97 64.96±20.52 17.17±1.18 70.14±12.03 69.27±13.67 −1.08±9.23 91.75±2.60 52.77±5.93 42.56±4.83
    MRPC 80.69±10.28 84.99±10.69 −8.12±22.99 74.87±5.38 74.33±23.12 −0.17±26.51 86.87±6.05 0.00±0.00 100.00±0.00
    WSC273 66.05±1.95 64.12±5.82 2.93±5.57 62.05±0.48 61.42±2.41 1.01±3.12 56.00±0.00 53.61±5.31 4.26±9.49
    SQuAD 1.1 55.33±8.22 44.55±9.73 19.45±12.39 67.18±8.23 61.07±9.04 9.11±7.13 87.22±0.26 70.78±21.84 18.88±24.95
    SQuAD 2.0 55.03±7.39 44.21±9.31 19.62±12.70 65.91±7.81 59.70±8.93 9.45±7.58 78.81±2.65 60.17±16.99 23.48±21.81
    WSJ 75.53±2.28 74.63±2.58 1.21±0.90 97.72±0.09 96.23±1.69 1.53±1.79
    CoNLL2003 44.61±3.48 37.30±9.29 16.31±20.05 51.54±2.88 42.64±9.24 17.13±17.76 90.57±0.38 72.24±16.75 20.26±18.42
    OntoNotesv5 17.74±8.51 18.68±7.00 −12.73±40.09 11.94±9.98 12.30±7.69 −17.51±51.73 79.99±6.54 61.98±20.30 23.47±20.45
    TACRED 31.44±31.24 32.64±33.27 0.58±7.88 35.67±30.89 38.67±31.59 −25.69±55.14 77.99±13.47 65.53±15.46 16.54±7.83
    注:“±”后的数字表示均值对应的标准差;“Laptop”和“Restaurant”分别表示“SemEval2014-Laptop”和“SemEval2014-Restaurant”数据集;“−”表示模型未完成指定任务.
    下载: 导出CSV

    表  5   3个GPT模型的鲁棒性表现

    Table  5   The Robustness Performance of Three GPT Models %

    数据集 gpt-3.5-turbo-0301 gpt-3.5-turbo-0613 gpt-4
    ori trans APDR ori trans APDR ori trans APDR
    Restaurant 91.43±1.23 66.00±11.28 27.80±2.74 97.05±0.86 59.98±16.37 38.28±16.56 95.81±2.27 71.07±9.15 25.80±9.69
    Laptop 86.67±2.15 59.36±21.97 31.25±23.31 93.91±1.45 63.82±19.10 32.16±19.83 98.74±1.88 74.42±16.01 24.75±15.42
    IMDB 91.60±0.20 90.86±0.50 0.80±0.47 96.58±1.05 95.99±1.63 0.62±0.90 93.81±3.69 91.91±5.31 2.05±3.83
    MNLI-m 73.03±7.44 41.75±17.05 42.27±21.87 71.88±7.99 35.30±16.00 51.85±20.03 84.24±7.00 53.46±10.50 36.81±9.04
    MNLI-mm 72.21±7.69 40.94±19.11 42.71±24.31 71.78±7.68 35.59±15.45 50.28±22.50 80.23±8.14 53.88±14.19 33.28±14.43
    SNLI 73.30±12.50 47.80±8.80 32.99±13.66 75.67±15.70 38.58±11.11 47.61±16.40 89.10±5.64 70.65±21.60 21.25±21.31
    QQP 79.32±5.97 64.96±20.52 17.17±1.18 81.42±8.49 49.71±16.16 38.22±22.66 53.14±19.48 84.91±15.74 −105.86±159.05
    MRPC 80.69±10.28 84.99±10.69 −8.12±22.99 85.70±11.16 70.65±16.74 14.29±30.49 60.38±7.06 94.65±4.68 −58.46±18.46
    WSC273 66.05±1.95 64.12±5.82 2.93±5.57 53.98±0.75 51.92±3.13 3.80±6.10 77.88±6.12 64.42±23.57 16.91±30.39
    SQuAD1.1 55.33±8.22 44.55±9.73 19.45±12.39 90.11±1.09 80.84±8.65 10.27±9.70 95.14±1.74 84.96±13.75 10.69±14.41
    SQuAD2.0 55.03±7.39 44.21±9.31 19.62±12.70 73.68±4.61 64.25±10.76 12.85±13.16 81.94±3.17 74.15±7.17 9.50±8.02
    WSJ 50.35±5.22 49.31±5.61 2.07±4.52 68.66±3.03 67.88±5.58 1.10±7.39
    CoNLL2003 44.61±3.48 37.30±9.29 16.31±20.05 66.78±2.98 49.76±11.69 25.38±17.69 83.23±1.86 65.53±13.86 21.25±16.66
    OntoNotesv5 17.74±8.51 18.68±7.00 −12.73±40.09 9.85±6.53 13.50±4.13 −66.86±72.42 7.58±15.72 6.70±10.70 10.87±15.47
    TACRED 31.44±31.24 32.64±33.27 0.58±7.88 37.00±35.29 40.23±34.38 −20.07±36.33 14.32±7.57 13.31±9.17 −0.02±74.59
    注:“±”后的数字表示均值对应的标准差;“Laptop”和“Restaurant”分别表示“SemEval2014-Laptop”和“SemEval2014-Restaurant”数据集;“−”表示模型未完成指定任务.
    下载: 导出CSV

    表  6   LLaMA2模型的鲁棒性表现

    Table  6   The Robustness Performance of LLaMA2 Model %

    数据集 LLaMA2-7B LLaMA2-13B
    ori trans APDR ori trans APDR
    Restaurant 87.85±1.68 52.38±7.01 40.34±8.22 87.10±3.17 35.16±9.07 59.84±9.45
    Laptop 79.40±2.93 56.23±12.68 28.96±16.86 81.15±2.82 47.21±18.58 41.87±22.81
    IMDB 92.04±1.68 91.06±2.68 1.08±1.43 88.17±2.30 87.40±2.89 0.88±1.21
    MNLI-m 46.76±16.03 27.64±13.39 34.77±34.65 54.47±15.15 44.70±18.95 12.52±43.92
    MNLI-mm 50.16±17.23 27.92±13.99 39.21±32.29 57.04±15.11 45.47±19.30 15.94±42.02
    SNLI 47.77±19.73 30.73±17.44 27.79±41.43 54.79±15.20 43.75±24.22 12.83±53.93
    QQP 59.93±16.77 33.18±11.02 40.58±24.61 54.49±12.91 40.17±14.45 21.36±32.47
    MRPC 70.66±14.76 66.49±16.68 1.92±33.62 69.59±17.74 33.75±32.70 43.09±63.48
    WSC273 52.40±3.60 53.10±1.68 −1.65±7.48 52.57±0.73 56.43±2.77 −7.33±4.58
    SQuAD1.1 79.64±0.69 67.85±9.98 14.80±12.51 71.27±1.16 63.67±5.14 10.65±7.12
    SQuAD2.0 78.25±0.95 66.30±9.66 15.26±12.36 69.40±1.27 61.77±5.05 10.99±7.20
    WSJ
    CoNLL2003 20.05±8.92 4.44±5.36 74.37±36.93 45.66±10.22 20.26±10.27 53.47±26.94
    OntoNotesv5 4.97±2.57 4.94±2.03 −19.85±76.91 5.87±5.21 5.36±3.34 −8.23±51.59
    TACRED 4.26±2.60 5.95±5.45 −16.67±104.08
    注:“±”后的数字表示均值对应的标准差;“Laptop”和“Restaurant”分别表示“SemEval2014-Laptop”和“SemEval2014-Restaurant”数据集;“−”表示模型未完成指定任务.
    下载: 导出CSV
  • [1]

    Wei J, Bosma M, Zhao V Y, et al. Finetuned language models are zero-shot learners[J]. arXiv preprint, arXiv: 2109.01652, 2021

    [2]

    Brown T B, Mann B, Ryder N, et al. Language models are few-shot learners[C/OL]//Advances in Neural Information Processing Systems. [2023-09-10].https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html

    [3]

    Touvron H, Lavril T, Izacard G, et al. LLaMA: Open and efficient foundation language models[J]. arXiv preprint, arXiv: 2302.13971, 2023

    [4]

    Anil R, Dai A M, Firat O, et al. PaLM 2 technical report[J]. arXiv preprint, arXiv: 2305.10403, 2023

    [5]

    Kaplan J, McCandlish S, Henighan T, et al. Scaling laws for neural language models[J]. arXiv preprint, arXiv: 2001.08361, 2020

    [6]

    Qin Chengwei, Zhang A, Zhang Zhuosheng, et al. Is ChatGPT a general-purpose natural language processing task solver?[C]//Proc of the 2023 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2023: 1339–1384

    [7]

    Bang Y, Cahyawijaya S, Lee N, et al. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity[J]. arXiv preprint, arXiv: 2302.04023, 2023

    [8]

    Kosinski M. Theory of mind may have spontaneously emerged in large language models[J]. arXiv preprint, arXiv: 2302.02083, 2023

    [9]

    Frieder S, Pinchetti L, Griffiths R R, et al. Mathematical capabilities of ChatGPT[C/OL]//Advances in Neural Information Processing Systems. [2023-09-10]. https://neurips.cc/virtual/2023/poster/73421

    [10]

    Wang Jindong, Hu Xixu, Hou Wenxin, et al. On the robustness of ChatGPT: An adversarial and out-of-distribution perspective[J]. arXiv preprint, arXiv: 2302.12095, 2023

    [11]

    Wang Boxin, Xu Chejian, Wang Shuohang, et al. Adversarial glue: A multi-task benchmark for robustness evaluation of language models[J]. arXiv preprint, arXiv: 2111.02840, 2021

    [12]

    Nie Y, Williams A, Dinan E, et al. Adversarial NLI: A new benchmark for natural language understanding[C]//Proc of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2020: 4885−4901

    [13]

    Fansi T A, Goel R, Wen Zhi, et al. DDXPlus: A new dataset for automatic medical diagnosis[C/OL]//Advances in Neural Information Processing Systems. [2023-09-10].https://proceedings.neurips.cc/paper_files/paper/2022/hash/cae73a974390c0edd95ae7aeae09139c-Abstract-Datasets_and_Benchmarks.html

    [14]

    Zhu Kaijie, Wang Jindong, Zhou Jiaheng, et al. PromptBench: Towards evaluating the robustness of large language models on adversarial prompts[J]. arXiv preprint, arXiv: 2306.04528, 2023

    [15]

    Wang Xiao, Liu Qin, Gui Tao, et al. TextFlint: Unified multilingual robustness evaluation toolkit for natural language processing[C]//Proc of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th Int Joint Conf on Natural Language Processing: System Demonstrations. Stroudsburg, PA: ACL, 2021: 347−355

    [16]

    Dong Qingxiu, Li Lei, Dai Damai, et al. A survey for in-context learning[J]. arXiv preprint, arXiv: 2301.00234, 2022

    [17]

    Zhuo T Y, Huang Yujin, Chen Chunyang, et al. Exploring AI ethics of ChatGPT: A diagnostic analysis[J]. arXiv preprint, arXiv: 2301.12867, 2023

    [18]

    Choi J H, Hickman K E, Monahan A B, et al. ChatGPT goes to law school[J]. Journal of Legal Education, 2021, 71(3): 387

    [19]

    Khalil M, Er E. Will ChatGPT get you caught? Rethinking of plagiarism detection[C]//Proc of Int Conf on Human-Computer Interaction. Berlin: Springer, 2023: 475−487

    [20]

    Alshater M. Exploring the role of artificial intelligence in enhancing academic performance: A case study of ChatGPT[J/OL]. [2023-09-12]. http://dx. doi.org/10.2139/ssrn.4312358

    [21]

    Tabone W, De Winter J. Using ChatGPT for human–computer interaction research: A primer[J]. Royal Society Open Science, 2023, 10(9): 21

    [22]

    Jeblick K, Schachtner B, Dexl J, et al. ChatGPT makes medicine easy to swallow: An exploratory case study on simplified radiology reports[J]. European Radiology, 2023: 1−9

    [23]

    Biswas S. ChatGPT and the future of medical writing[J]. Radiology, 2023, 307(2): e223312 doi: 10.1148/radiol.223312

    [24]

    Goodfellow I J, Shlens J, Szegedy C. Explaining and harnessing adversarial examples[J]. arXiv preprint, arXiv: 1412.6572, 2014

    [25]

    Pontiki M, Galanis D, Pavlopoulos J, et al. SemEval-2014 Task 4: Aspect based sentiment analysis[C]//Proc of the 8th Int Workshop on Semantic Evaluation. Stroudsburg, PA: ACL, 2004: 27−35

    [26]

    Maas A L, Daly R E, Pham P T, et al. Learning word vectors for sentiment analysis[C]// Proc of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg, PA: ACL, 2011: 142−150

    [27]

    Williams A, Nangia N, Bowman S R. A broad-coverage challenge corpus for sentence understanding through inference[J]. arXiv preprint, arXiv: 1704.05426, 2017

    [28]

    Dolan W B, Brockett C. Automatically constructing a corpus of sentential paraphrases[C]//Proc of the 3rd Int Workshop on Paraphrasing (IWP2005). Jeju Island: Asia Federation of Natural Language Processing, 2005: 9−16

    [29]

    Wang Zhiguo, Hamza W, Florian R. Bilateral multi-perspective matching for natural language sentences[C]//Proc of the 26th Int Joint Conf on Artificial Intelligence. Australia: IJCAI. org, 2017: 4144−4150

    [30]

    Levesque H, Davis E, Morgenstern L. The winograd schema challenge[C]//Proc of 13th Int Conf on the Principles of Knowledge Representation and Reasoning. Palo Alto, CA: AAAI, 2012: 552−561

    [31]

    Rajpurkar P, Zhang Jian, Lopyrev K, et al. SQuAD: 100, 000+ questions for machine comprehension of text[C]//Proc of the 2016 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2016: 2383−2392

    [32]

    Rajpurkar P, Jia R, Liang P. Know what you don’t know: Unanswerable questions for SQuAD[C]//Proc of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Stroudsburg, PA: ACL, 2018: 784−789

    [33]

    Marcus M, Santorini B, Marcinkiewicz M A. Building a large annotated corpus of English: The Penn Treebank[J]. Computational Linguistics, 1993, 19(2): 313−330

    [34]

    Sang E T K, De Meulder F. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition[C]//Proc of the 7th Conf on Natural Language Learning at HLT-NAACL 2003. Stroudsburg, PA: ACL, 2003: 142−147

    [35]

    Weischedel R, Palmer M, Marcus M, et al. Ontonotes release 5.0 ldc2013t19[J]. Linguistic Data Consortium, 2013, 23(1): 170

    [36]

    Zhang Yuhao, Zhong V, Chen Danqi, et al. Position-aware attention and supervised data improve slot filling[C]//Proc of Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2017: 35−45

    [37]

    Ouyang L, Wu J, Jiang X, et al. Training language models to follow instructions with human feedback[C/OL]//Advances in Neural Information Processing Systems. [2023-09-10]. https://proceedings.neurips.cc/paper_files/paper/2022/hash/b1efde53be364a73914f58805a001731-Abstract-Conference.html

    [38]

    Christiano P F, Leike J, Brown T, et al. Deep reinforcement learning from human preferences[C/OL]// Advances in Neural Information Processing Systems. [2023-09-10]. https://papers.nips.cc/paper_files/paper/2017/hash/d5e2c0adad503c91f91df240d0cd4e49-Abstract.html

    [39]

    Kenton J D M W C, Toutanova L K. BERT: Pre-training of deep bidirectional Transformers for language understanding[C]//Proc of NAACL-HLT. Stroudsburg, PA: ACL, 2019: 4171−4186

    [40]

    Chen Lingjiao, Zaharia M, Zou J. How is ChatGPT’s behavior changing over time?[J]. arXiv preprint, arXiv: 2307.09009, 2023

    [41]

    Tu Shangqing, Li Chunyang, Yu Jifan, et al. ChatLog: Recording and analyzing ChatGPT across time[J]. arXiv preprint, arXiv: 2304.14106, 2023

  • 期刊类型引用(3)

    1. 孙剑明,赵梦鑫. 边缘计算下差分隐私的应用研究综述. 计算机科学. 2024(S1): 896-904 . 百度学术
    2. 张帅,陈建广,陈锐志,汪云甲,黄风华,李召洋. 基于MIMU与Wi-Fi的普适室内定位方法综述. 导航定位与授时. 2024(05): 1-16 . 百度学术
    3. 张学军,席阿友,加小红,张斌,李梅,杜晓刚,黄海燕. 基于深度学习的指纹室内定位对抗样本攻击研究. 计算机工程. 2024(10): 228-239 . 百度学术

    其他类型引用(10)

图(11)  /  表(6)
计量
  • 文章访问数:  605
  • HTML全文浏览量:  77
  • PDF下载量:  214
  • 被引次数: 13
出版历程
  • 收稿日期:  2023-10-09
  • 修回日期:  2024-03-11
  • 网络出版日期:  2024-03-11
  • 刊出日期:  2024-05-13

目录

    /

    返回文章
    返回