Abstract:
Graph neural networks (GNNs) and code pre-trained models (CodePTMs) have been widely applied in vulnerability detection tasks, achieving the state-of-the-art (SOTA) performance on vulnerability benchmark datasets. However, such datasets (denoted as regular datasets) commonly collect vulnerable and non-vulnerable samples from different sources, resulting in significantly divergent code characteristics. Consequently, existing learning models tend to capture spurious features specific to certain datasets, rather than understanding the actual root causes of vulnerabilities. A robust and reliable vulnerability detection model should be able to make pairwise predictions, that is, recognizing the vulnerable code as vulnerable while the corresponding fixed version as non-vulnerable. However, the capability of existing learning models to make such pairwise predictions is significant and has not been explored extensively. This study aims to bridge this gap. Specifically, we first construct a large-scale real-world dataset containing vulnerable code and the corresponding fixed version. This ensures that positive and negative samples are derived from the same source, thus forming similar distributions in terms of code characteristics. We then evaluate eight mainstream SOTA vulnerability detection models in the rigorous setting to see if they can distinguish vulnerable code from its fixed versions (i.e., paired vulnerability instances). Our results reveal that all models perform poorly in making pairwise predictions when trained on regular datasets. We also find that retraining with these paired instances can improve the models’ performance by 17% on average. Furthermore, we highlight the limitations of existing CodePTMs in aggregating local code information (i.e., a key statement concerning vulnerable semantics). Driven by our findings, we devise a new approach and the evaluation results demonstrate that it outperforms the SOTA model UniXcoder by 58.47% towards making pairwise predictions.