基于比较学习的漏洞检测方法

陈小全; 刘剑; 夏翔宇; 周绍翔

doi:10.7544/issn1000-1239.202220140

基于比较学习的漏洞检测方法

A Vulnerability Detection Approach Based on Comparative Learning

摘要

摘要: 当前基于深度学习的源代码漏洞检测是一种效率较高的漏洞分析方式，但其面临2个挑战：容量较大的数据集和有效的学习方式. 针对这2个挑战做了2方面的研究工作：首先基于SARD数据集构建了样本容量为280793的多漏洞数据集，包含150种CWE漏洞类型. 其次提出基于比较学习的深度学习方法. 其核心思想是为深度学习训练集中每一个样本构建1个类型相同的样本集合，以及1个类型不相同的样本集合，形成一种比较学习的氛围. 基于该思想创建的训练数据集，深度学习模型在训练的过程中，不但可以学习同类型样本大量的、细微的特征，还可以提取不同类型样本中区分性较强的特征. 经过实验验证，基于所创建的数据集和提出的学习方法训练的深度学习模型可以识别150种CWE漏洞类型，准确率可以达到92.0%，平均PR值可以达到0.85，平均ROC-AUC值可以达到0.96. 此外，也对基于深度学习的漏洞分析技术中普遍使用的代码符号化技术进行分析与讨论. 实验表明，深度学习训练过程中，是否对代码进行符号化，并不会影响深度学习模型的漏洞识别准确率.

Abstract: At present, source code vulnerability detection based on deep learning is a highly efficient vulnerability analysis approach. But it faces two challenges: large data sets and effective learning approach. We have done some research work on these two challenges. Firstly, a multi-vulnerability dataset with a sample size of 280793 is constructed based on the SARD dataset, including 150 CWE vulnerabilities. Secondly, the deep learning approach based on comparative learning is proposed. Its core idea is to construct a sample set of the same type and a sample set of different types for each sample in the deep learning training set, forming a comparative learning atmosphere. Based on the training data set created by this idea, the deep learning model can not only learn a large number of more subtle features of the same type of samples, but also extract highly distinguishable features of different types of samples in the training process. Through experimental verification, the deep learning model trained based on the data set and the proposed learning approach in the paper can identify 150 CWE vulnerabilities with an accuracy of 92.0%, an average PR value of 0.84 and an average ROC-AUC value of 0.96. In addition, we also analyze and discuss the commonly used code symbolization technology in deep learning-based vulnerability analysis technology. Experiments show that, in the process of deep learning training, whether the code is symbolized or not will not affect the vulnerability identification accuracy of the deep learning model.

HTML全文

参考文献(52)

施引文献

资源附件(0)