A Vulnerability Detection Approach Based on Comparative Learning

Chen Xiaoquan; Liu Jian; Xia Xiangyu; Zhou Shaoxiang

doi:10.7544/issn1000-1239.202220140

Journal of Computer Research and Development > 2023 > 60(9): 2152-2168. > DOI: 10.7544/issn1000-1239.202220140 CSTR: 32373.14.issn1000-1239.202220140

Chen Xiaoquan, Liu Jian, Xia Xiangyu, Zhou Shaoxiang. A Vulnerability Detection Approach Based on Comparative Learning[J]. Journal of Computer Research and Development, 2023, 60(9): 2152-2168. DOI: 10.7544/issn1000-1239.202220140

Citation:

PDF (3343 KB)

A Vulnerability Detection Approach Based on Comparative Learning

Chen Xiaoquan^{1, 2,},
Liu Jian^{2, 3, ,},
Xia Xiangyu¹,
Zhou Shaoxiang¹

1.
Department of Information, Beijing City University, Beijing 100191
2.
CAS Key Laboratory of Network Assessment Technology(Institute of Information Engineering, Chinese Academy of Sciences), Beijing 100093
3.
School of Cyber Security, University of Chinese Academy of Sciences, Beijing 100049

Funds: This work was supported by the Open Project of Key Laboratory of Network Assessment Technology(Institute of Information Engineering, Chinese Academy of Sciences) (KFKT2022-005) and the Strategic Priority Research Program of Chinese Academy of Sciences(XDC02040100).

More Information

Author Bio:
Chen Xiaoquan: born in 1976. PhD, senior engineer. His main research interests include software security, vulnerability analysis, and deep learning

Liu Jian: born in 1976. PhD, associate professor, PhD supervisor. Senior member of CCF. His main research interests include software security and vulnerability analysis

Xia Xiangyu: born in 2001. Undergraduate. His main research interests include vulnerability analysis and software security

Zhou Shaoxiang: born in 1999. Undergraduate. His main research interests include vulnerability analysis and software security
Received Date: January 28, 2022
Revised Date: October 23, 2022
Available Online: April 17, 2023

Graphical Abstract

Abstract

Abstract

At present, source code vulnerability detection based on deep learning is a highly efficient vulnerability analysis approach. But it faces two challenges: large data sets and effective learning approach. We have done some research work on these two challenges. Firstly, a multi-vulnerability dataset with a sample size of 280793 is constructed based on the SARD dataset, including 150 CWE vulnerabilities. Secondly, the deep learning approach based on comparative learning is proposed. Its core idea is to construct a sample set of the same type and a sample set of different types for each sample in the deep learning training set, forming a comparative learning atmosphere. Based on the training data set created by this idea, the deep learning model can not only learn a large number of more subtle features of the same type of samples, but also extract highly distinguishable features of different types of samples in the training process. Through experimental verification, the deep learning model trained based on the data set and the proposed learning approach in the paper can identify 150 CWE vulnerabilities with an accuracy of 92.0%, an average PR value of 0.84 and an average ROC-AUC value of 0.96. In addition, we also analyze and discuss the commonly used code symbolization technology in deep learning-based vulnerability analysis technology. Experiments show that, in the process of deep learning training, whether the code is symbolized or not will not affect the vulnerability identification accuracy of the deep learning model.
- vulnerability detection,
- comparative learning,
- deep learning,
- unbalanced data,
- model checking

FullText(HTML)

References (52)

References

[1]	Larsen P, Homescu A, Brunthaler S, et al. SoK: Automated software diversity[C] //Proc of the 35th IEEE Symp on Security and Privacy. Piscataway, NJ: IEEE, 2014: 276−291
[2]	Vasilyev V, Vulfin A, Gvozdev V, et al. Semantic text analysis technology application in assessing current threats and software vulnerabilities[C] //Proc of the 20th IFAC Conf on Technology, Culture, and International Stability. Amsterdam: Elsevier, 2021: 599−604
[3]	Yang Jun, Zhou Peng, Ni Yunze. ASVG: Automated software vulnerability sample generation technology based on source code[C] //Proc of Int Conf on Broadband and Wireless Computing, Communication and Applications. Berlin: Springer, 2018: 316−325
[4]	Lu Hui, Jin Chengjie, Helu Xiaohan, et al. Research on intelligent detection of command level stack pollution for binary program analysis[J]. Mobile Networks and Applications, 2021, 26(4): 1723−1732 doi: 10.1007/s11036-019-01507-0
[5]	Schubert P D, Gazzillo P, Patterson Z, et al. Static data-flow analysis for software product lines in C[J]. Automated Software Engineering, 2022, 29(1): 1−37 doi: 10.1007/s10515-021-00310-0
[6]	Cummins C, Fisches Z V, Ben-Nun T, et al. Programl: A graph-based program representation for data flow analysis and compiler optimizations[C] //Proc of the 38th Int Conf on Machine Learning. New York: PMLR, 2021: 2244−2253
[7]	Bensalim S, Klein D, Barber T, et al. Talking about my generation: Targeted DOM-based XSS exploit generation using dynamic data flow analysis[C/OL] //Proc of the 14th European Workshop on Systems Security. 2021: 27−33[2022-01-11]. https://dl.acm.org/doi/abs/10.1145/3447852.3458718
[8]	Cui Mohan, Chen Chengjun, Xu Hui, et al. SafeDrop: Detecting memory deallocation bugs of rust programs via static data-flow analysis[J]. arXiv preprint, arXiv: 2103.15420, 2021
[9]	Mues M, Howar F . GDart : An ensemble of tools for dynamic symbolic execution on the Java virtual machine (competition contribution)[C] //Proc of the Int Conf on Tools and Algorithms for the Construction and Analysis of Systems. Berlin: Springer, 2022: 435−439
[10]	Cha S, Lee M, Lee S, et al. SymTuner: Maximizing the power of symbolic execution by adaptively tuning external parameters[C] //Proc of the 44th Int Conf on Software Engineering. New York: ACM, 2022: 2068−2079
[11]	Schemmel D, Büning J, Busse F, et al. A deterministic memory allocator for dynamic symbolic execution[C/OL] //Proc of the 36th European Conf on Object-Oriented Programming. 2022[2022-07-09]. https://srg.doc.ic.ac.uk/files/papers/kdalloc-ecoop-22.pdf
[12]	Chalupa M, Mihalkovič V, Řechtáčková A, et al. Symbiotic 9: String analysis and backward symbolic execution with loop folding[C] //Proc of Int Conf on Tools and Algorithms for the Construction and Analysis of Systems. Berlin: Springer, 2022: 462−467
[13]	Van Ouytsel C H B, Legay A. Malware analysis with symbolic execution and graph kernel[J]. arXiv preprint, arXiv: 2204.05632, 2022
[14]	Zhang Hangwei, Lu Kai, Zhou Xu, et al. SIoTFuzzer: Fuzzing web interface in IoT firmware via stateful message generation[J]. Applied Sciences, 2021, 11(7): 3120−3137 doi: 10.3390/app11073120
[15]	Ispoglou K, Austin D, Mohan V, et al. FuzzGen: Automatic fuzzer generation[C] //Proc of the 29th USENIX Security Symp. Berkeley, CA: USENIX Association, 2020: 2271−2287
[16]	Beaman C, Redbourne M, Mummery J D, et al. Fuzzing vulnerability discovery techniques: Survey, challenges and future directions[J/OL]. Computers & Security, 2022[2022-07-07]. https://www.sciencedirect.com/science/article/pii/S0167404822002073
[17]	Zhang Gen, Wang Pengfei, Yue Tai, et al. ovAFLow: Detecting memory corruption bugs with fuzzing-based taint inference[J]. Journal of Computer Science and Technology, 2022, 37(2): 405−422 doi: 10.1007/s11390-021-1600-9
[18]	Ma Rongkuan, Zheng Hao, Wang Jingyi, et al. Automatic protocol reverse engineering for industrial control systems with dynamic taint analysis[J]. Frontiers of Information Technology & Electronic Engineering, 2022, 23(3): 351−360
[19]	Das D, Bose P, Machiry A, et al. Hybrid pruning: Towards precise pointer and taint analysis[C/OL] //Proc of the Int Conf on Detection of Intrusions and Malware, and Vulnerability Assessment. Berlin: Springer, 2022[2022-10-14]. https://linkspringer.53yu.com/chapter/10.1007/978-3-031-09484-2_1
[20]	Yavuz T, Brant C. Security analysis of IoT frameworks using static taint analysis[C] //Proc of the 12th ACM Conf on Data and Application Security and Privacy. New York: ACM, 2022: 203−213
[21]	Liang Jie, Wang Mingzhe, Zhou Chijin, et al. PATA: Fuzzing with path aware taint analysis[C] //Proc of the 43rd IEEE Symp on Security and Privacy. Piscataway, NJ: IEEE, 2022: 154−170
[22]	Russell R, Kim L, Hamilton L, et al. Automated vulnerability detection in source code using deep representation learning[C]//Proc of the 17th IEEE Int Conf on Machine Learning and Applications. Piscataway, NJ: IEEE, 2018: 757−762
[23]	Li Zhen, Zou Deqing, Xu Shouhuai, et al. Vuldeepecker: A deep learning-based system for vulnerability detection[J]. arXiv preprint, arXiv: 1801.01681, 2018
[24]	Zhou Yaqin, Liu Shangqing, Siow J, et al. Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks[J]. arXiv preprint, arXiv: 1909.03496, 2019
[25]	Duan Xu, Wu Jingzheng, Ji Shouling, et al. VulSniper: Focus your attention to shoot fine-grained vulnerabilities[C] //Proc of the 28th Int Joint Conf on Artificial Intelligence. San Francisco, CA : Morgan Kaufmann, 2019: 4665−4671
[26]	Chakraborty S, Krishna R, Ding Yangruibo, et al. Deep learning based vulnerability detection: Are we there yet[J]. IEEE Transactions on Software Engineering, 2022, 48(9): 3280−3296 doi: 10.1109/TSE.2021.3087402
[27]	段旭,吴敬征,罗天悦,等. 基于代码属性图及注意力双向LSTM的漏洞挖掘方法[J]. 软件学报,2020,31(11):3404−3420 Duan Xu, Wu Jingzheng, Luo Tianyue, et al. Vulnerability mining method based on code property graph and attention BiLSTM[J]. Journal of Software, 2020, 31(11): 3404−3420 (in Chinese)
[28]	Zou Deqing, Wang Sujuan, Xu Shouhuai, et al. μVulDeePecker: A deep learning-based system for multiclass vulnerability detection[J]. IEEE Transactions on Dependable and Secure Computing, 2021, 18(5): 2224−2236
[29]	顾绵雪,孙鸿宇,韩丹,等. 基于深度学习的软件安全漏洞挖掘[J]. 计算机研究与发展,2021,58(10):2140−2162 Gu Mianxue, Sun Hongyu, Han Dan, et al. Software security vulnerability mining based on deep learning[J]. Journal of Computer Research and Development, 2021, 58(10): 2140−2162 (in Chinese)
[30]	Wu Fang, Wang Jigang, Liu Jiqiang, et al. Vulnerability detection with deep learning[C] //Proc of the 3rd IEEE Int Conf on Computer and Communications. Piscataway, NJ: IEEE, 2017: 1298−1302
[31]	Nguyen V, Le T, Vel O, et al. Dual-component deep domain adaptation: A new approach for cross project software vulnerability detection[C] //Proc of the Pacific-Asia Conf on Knowledge Discovery and Data Mining. Berlin: Springer, 2020: 699−711
[32]	Li Xin, Wang Lu, Yang Xin, et al. Automated software vulnerability detection based on hybrid neural network[J]. Applied Sciences, 2021, 11(7): 3201−3216 doi: 10.3390/app11073201
[33]	Cao Defu, Huang Jing, Zhang Xuanyu, et al. FTCLNet: Convolutional LSTM with Fourier transform for vulnerability detection[C] //Proc of the 19th IEEE Int Conf on Trust, Security and Privacy in Computing and Communications (TrustCom). Piscataway, NJ: IEEE, 2020: 539−546
[34]	Mao Yi, Li Yun, Sun Jiatai, et al. Explainable software vulnerability detection based on attention-based bidirectional recurrent neural networks[C] //Proc of 2020 IEEE Int Conf on Big Data. Piscataway, NJ: IEEE, 2020: 4651−4656
[35]	Feng Hantao, Fu Xiaotong, Sun Hongyu, et al. Efficient vulnerability detection based on abstract syntax tree and deep learning[C] //Proc of IEEE Conf on Computer Communications Workshops. Piscataway, NJ: IEEE, 2020: 722−727
[36]	Zhang Jian, Wang Xu, Zhang Hongyu, et al. A novel neural source code representation based on abstract syntax tree[C] //Proc of the 41st Int Conf on Software Engineering. Piscataway, NJ: IEEE, 2019: 783−794
[37]	Li Xin, Wang Lu, Yang Xin, et al. Automated vulnerability detection in source code using minimum intermediate representation learning[J]. Applied Sciences, 2020, 10(5): 1692−1707 doi: 10.3390/app10051692
[38]	Tian Junfeng, Xing Wenjing, Li Zhen. BVDetector: A program slice-based binary code vulnerability intelligent detection system[J/OL]. Information and Software Technology, 2020[2022-07-01]. https://www.sciencedirect.com/science/article/abs/pii/S0950584920300392
[39]	Allamanis M, Brockschmidt M, Khademi M. Learning to represent programs with graphs[J]. arXiv preprint, arXiv: 1711.00740, 2017
[40]	Zeng Jingxiang, Nie Xiaofan, Chen Liwei, et al. An efficient vulnerability extrapolation using similarity of graph kernel of PDGs[C] //Proc of the 19th Int Conf on Trust, Security and Privacy in Computing and Communications. Piscataway, NJ: IEEE, 2020: 1664−1671
[41]	Guo Ning, Li Xiaoyong, Yin Hui, et al. VulHunter: An automated vulnerability detection system based on deep learning and bytecode[C] //Proc of the Int Conf on Information and Communications Security. Berlin: Springer, 2019: 199−218
[42]	Wang Lu, Li Xin, Wang Ruiheng, et al. PreNNsem: A heterogeneous ensemble learning framework for vulnerability detection in software[J]. Applied Sciences, 2020, 10(22): 7954−7970 doi: 10.3390/app10227954
[43]	Ziems N, Wu Shaoen. Security vulnerability detection using deeplearning natural language processing[C/OL] //Proc of IEEE Conf on Computer Communications Workshops (INFOCOM WKSHPS). Piscataway, NJ: IEEE, 2021[2022-10-14]. https://ieeexplore.ieee.org/abstract/document/9484500
[44]	Fidalgo A, Medeiros I, Antunes P, et al. Towards a deep learning model for vulnerability detection on web application variants[C] //Proc of IEEE Int Conf on Software Testing, Verification and Validation Workshops (ICSTW). Piscataway, NJ: IEEE, 2020: 465−476
[45]	Li Zhen, Zou Deqing, Xu Shouhuai, et al. VulDeeLocator: A deep learning-based fine-grained vulnerability detector[J]. IEEE Transactions on Dependable and Secure Computing, 2022, 19(4): 2821−2837 doi: 10.1109/TDSC.2021.3076142
[46]	Wu Yuelong , Lu Jintian, Zhang Yunyi, et al. Vulnerability detection in C/C++ source code with graph representation learning[C] //Proc of the 11th Annual Computing and Communication Workshop and Conf (CCWC). Piscataway, NJ: IEEE, 2021: 1519−1524
[47]	Black P E. Software assurance metrics and tool evaluation[C/OL]//Proc of the Software Engineering Research and Practice. 2005[2022-07-02]. https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.107.3492&rep=rep1&type=pdf
[48]	Church K W. word2vec[J]. Natural Language Engineering, 2017, 23(1): 155−162 doi: 10.1017/S1351324916000334
[49]	Goldberg Y, Levy O. word2vec explained: Deriving Mmikolov et al. ’s negative-sampling word-embedding method[J]. arXiv preprint, arXiv: 1402.3722, 2014
[50]	Grohe M. word2vec, node2vec, graph2vec, x2vec: Towards a theory of vector embeddings of structured data[C/OL] //Proc of the 39th ACM SIGMOD-SIGACT-SIGAI Symp on Principles of Database Systems. 2020[2022-07-03]. https://doi.org/10.1145/3375395.3387641
[51]	Ferreira J, Gonçalo Oliveira H, Rodrigues R. Improving NLTK for processing portuguese[C/OL] //Proc of the 8th Symp on Languages, Applications and Technologies. 2019[2022-07-04]. https://drops.dagstuhl.de/opus/volltexte/2019/10885/pdf/OASIcs-SLATE-2019-18.pdf
[52]	Yao Jiawei. Automated sentiment analysis of text data with NLTK[J/OL]. Proceedings of Journal of Physics. 2019[2022-07-04]. https://iopscience.iop.org/article/10.1088/1742-6596/1187/5/052020/pdf

[1]	Zhang Jing, Ju Jialiang, Ren Yonggong. Double-Generators Network for Data-Free Knowledge Distillation[J]. Journal of Computer Research and Development, 2023, 60(7): 1615-1627. DOI: 10.7544/issn1000-1239.202220024
[2]	Zhao Jingxin, Yue Xinghui, Feng Chongpeng, Zhang Jing, Li Yin, Wang Na, Ren Jiadong, Zhang Haoxing, Wu Gaofei, Zhu Xiaoyan, Zhang Yuqing. Survey of Data Privacy Security Based on General Data Protection Regulation[J]. Journal of Computer Research and Development, 2022, 59(10): 2130-2163. DOI: 10.7544/issn1000-1239.20220800
[3]	Song Xuan, Gao Yunjun, Li Yong, Guan Qingfeng, Meng Xiaofeng. Spatial Data Intelligence: Concept, Technology and Challenges[J]. Journal of Computer Research and Development, 2022, 59(2): 255-263. DOI: 10.7544/issn1000-1239.20220108
[4]	Wang Huiyong, Tang Shijie, Ding Yong, Wang Yujue, Li Jiahui. Survey on Biometrics Template Protection[J]. Journal of Computer Research and Development, 2020, 57(5): 1003-1021. DOI: 10.7544/issn1000-1239.2020.20190371
[5]	Wang Huifeng, Li Zhanhuai, Zhang Xiao, Sun Jian, Zhao Xiaonan. A Self-Adaptive Audit Method of Data Integrity in the Cloud Storage[J]. Journal of Computer Research and Development, 2017, 54(1): 172-183. DOI: 10.7544/issn1000-1239.2017.20150900
[6]	Wang Liang, Wang Weiping, Meng Dan. Privacy Preserving Data Publishing via Weighted Bayesian Networks[J]. Journal of Computer Research and Development, 2016, 53(10): 2343-2353. DOI: 10.7544/issn1000-1239.2016.20160465
[7]	Wang Jing, Huang Chuanhe, Wang Jinhai. An Access Control Mechanism with Dynamic Privilege for Cloud Storage[J]. Journal of Computer Research and Development, 2016, 53(4): 904-920. DOI: 10.7544/issn1000-1239.2016.20150158
[8]	Fu Yingxun, Luo Shengmei, Shu Jiwu. Survey of Secure Cloud Storage System and Key Technologies[J]. Journal of Computer Research and Development, 2013, 50(1): 136-145.
[9]	Hou Qinghua, Wu Yongwei, Zheng Weimin, and Yang Guangwen. A Method on Protection of User Data Privacy in Cloud Storage Platform[J]. Journal of Computer Research and Development, 2011, 48(7): 1146-1154.
[10]	Ren Wei, Ren Yi, Zhang Hui, Zhao Junge. A Secure and Efficient Data Survival Strategy in Unattended Wireless Sensor Network[J]. Journal of Computer Research and Development, 2009, 46(12): 2093-2100.

Cited By

Cited by

Periodical cited type(9)

1.	陈彩华，佘程熙，王庆阳. 可信机器学习综述. 工业工程. 2024(02): 14-26 .
2.	饶高琦，周立炜. 论语言智能的治理. 语言战略研究. 2024(03): 38-48 .
3.	穆春阳，李闯，马行，刘永鹿，杨科，刘宝成. 改进YOLOv7-tiny的轻量化大型铸件焊缝缺陷检测. 组合机床与自动化加工技术. 2024(07): 156-160 .
4.	喻继军，熊明华. 电子商务推荐系统公平性研究进展. 现代信息科技. 2023(14): 115-124 .
5.	范卓娅，孟小峰. 算法公平与公平计算. 计算机研究与发展. 2023(09): 2048-2066 . 本站查看
6.	吴雷，杜文研，林超然. 基于专利数据应用LDA和N-BEATS组合方法的技术主题预测研究. 数字图书馆论坛. 2023(11): 62-73 .
7.	古天龙，李龙，常亮，罗义琴. 公平机器学习:概念、分析与设计. 计算机学报. 2022(05): 1018-1051 .
8.	王文鑫，张健毅. 联邦学习公平性研究综述. 北京电子科技学院学报. 2022(02): 122-134 .
9.	郁建兴，刘宇轩. 社会治理中的深度学习算法公平性. 信息技术与管理应用. 2022(01): 17-27 .