Abstract:
At present, source code vulnerability detection based on deep learning is a highly efficient vulnerability analysis approach. But it faces two challenges: large data sets and effective learning approach. We have done some research work on these two challenges. Firstly, a multi-vulnerability dataset with a sample size of 280793 is constructed based on the SARD dataset, including 150 CWE vulnerabilities. Secondly, the deep learning approach based on comparative learning is proposed. Its core idea is to construct a sample set of the same type and a sample set of different types for each sample in the deep learning training set, forming a comparative learning atmosphere. Based on the training data set created by this idea, the deep learning model can not only learn a large number of more subtle features of the same type of samples, but also extract highly distinguishable features of different types of samples in the training process. Through experimental verification, the deep learning model trained based on the data set and the proposed learning approach in the paper can identify 150 CWE vulnerabilities with an accuracy of 92.0%, an average PR value of 0.84 and an average ROC-AUC value of 0.96. In addition, we also analyze and discuss the commonly used code symbolization technology in deep learning-based vulnerability analysis technology. Experiments show that, in the process of deep learning training, whether the code is symbolized or not will not affect the vulnerability identification accuracy of the deep learning model.