高级检索

    基于成对预测的智能漏洞检测模型可信评估与分析

    Towards Trustworthy Evaluation of Intelligent Vulnerability Detection Models under Pairwise Prediction

    • 摘要: 图神经网络(graph neural networks,GNNs)和代码预训练模型(code pre-trained models, CodePTMs)已被广泛应用于漏洞检测任务,并在漏洞基准数据集上达到了当前最优(state-of-the-art, SOTA)性能。但这类数据集(称常规数据集)通常从不同来源收集漏洞样本和非漏洞样本,导致代码特征差异显著。因此,现有学习模型往往捕捉到的是特定数据集中的虚假特征,并未真正理解漏洞的根本成因。一个鲁棒且可靠的漏洞检测模型应当能对漏洞修复前后样本做出成对预测,即既能将漏洞代码识别为有漏洞,又能将其对应的修复版本判定为无漏洞。然而,现有学习模型在这种能力上存在显著不足,且尚未得到充分研究。本研究旨在填补这一空白。具体而言,首先构建了一个大规模真实世界数据集,其中包含漏洞代码及其对应的修复版本。这确保了正负样本来源一致,从而在代码特征上形成相似分布。随后在严格实验设置下对8个主流SOTA漏洞检测模型进行评估,检验其能否区分漏洞代码与其修复版本(即配对漏洞实例)。实验结果表明,基于常规数据集进行训练时,所有模型在成对预测任务上表现均不佳。研究发现,使用配对实例重新训练可使模型性能平均提升17%。此外,本文揭示了现有CodePTMs在聚合局部代码信息(即与漏洞语义相关的关键语句)方面存在的局限性。基于上述发现,本论文提出了一种新方法,评估结果表明,该方法在成对预测任务上相较于当前最优模型UniXcoder提升了58.47%。

       

      Abstract: Graph neural networks (GNNs) and code pre-trained models (CodePTMs) have been widely applied in vulnerability detection tasks, achieving the state-of-the-art (SOTA) performance on vulnerability benchmark datasets. However, such datasets (denoted as regular datasets) commonly collect vulnerable and non-vulnerable samples from different sources, resulting in significantly divergent code characteristics. Consequently, existing learning models tend to capture spurious features specific to certain datasets, rather than understanding the actual root causes of vulnerabilities. A robust and reliable vulnerability detection model should be able to make pairwise predictions, that is, recognizing the vulnerable code as vulnerable while the corresponding fixed version as non-vulnerable. However, the capability of existing learning models to make such pairwise predictions is significant and has not been explored extensively. This study aims to bridge this gap. Specifically, we first construct a large-scale real-world dataset containing vulnerable code and the corresponding fixed version. This ensures that positive and negative samples are derived from the same source, thus forming similar distributions in terms of code characteristics. We then evaluate eight mainstream SOTA vulnerability detection models in the rigorous setting to see if they can distinguish vulnerable code from its fixed versions (i.e., paired vulnerability instances). Our results reveal that all models perform poorly in making pairwise predictions when trained on regular datasets. We also find that retraining with these paired instances can improve the models’ performance by 17% on average. Furthermore, we highlight the limitations of existing CodePTMs in aggregating local code information (i.e., a key statement concerning vulnerable semantics). Driven by our findings, we devise a new approach and the evaluation results demonstrate that it outperforms the SOTA model UniXcoder by 58.47% towards making pairwise predictions.

       

    /

    返回文章
    返回