Citation: | Guo Hongjing, Tao Chuanqi, Huang Zhiqiu. Surprise Adequacy-Guided Deep Neural Network Test Inputs Generation[J]. Journal of Computer Research and Development, 2024, 61(4): 1003-1017. DOI: 10.7544/issn1000-1239.202220745 |
Due to the complexity and uncertainty of deep neural network (DNN) models, generating test inputs to comprehensively test general and corner case behaviors of DNN models is of great significance for ensuring model quality. Current research primarily focuses on designing coverage criteria and utilizing fuzzing testing technique to generate test inputs, thereby improving test adequacy. However, few studies have taken into consideration the diversity and individual fault-revealing ability of test inputs. Surprise adequacy quantifies the neuron activation differences between a test input and the training set. It is an important metric to measure test adequacy, which has not been leveraged for test input generation. Therefore, we propose a surprise adequacy-guided test input generation approach. Firstly, the approach selects important neurons that contribute more to decision-making. Activation values of these neurons are used as features to improve the surprise adequacy metric. Then, seed test inputs are selected with error-revealing capability based on the improved surprise adequacy measurements. Finally, the approach utilizes the idea of coverage-guided fuzzing testing to jointly optimize the surprise adequacy value of test inputs and the prediction probability differences among classes. The gradient ascent algorithm is adopted to calculate the perturbation and iteratively generate test inputs. Empirical studies on 5 DNN models covering 4 different image datasets demonstrate that the improved surprise adequacy metric effectively captures surprising test inputs and reduces the time cost of the calculation. Concerning test input generation, compared with DeepGini and RobOT, the follow-up test set generated by using the proposed seed input selection strategy exhibits the highest surprise coverage improvement of 5.9% and 15.9%, respectively. Compared with DLFuzz and DeepXplore, the proposed approach achieves the highest surprise coverage improvement of 26.5% and 33.7%, respectively.
[1] |
The New York Times. After fatal uber crash, a self-driving start-up moves forward[EB/OL]. [2022-06-10].https://www.nytimes.com/2018/05/07/technology/uber-crash-autonomous-driveai.html
|
[2] |
Zhang Jie, Harman M, Ma Lei, et al. Machine learning testing: Survey, landscapes and horizons[J]. IEEE Transactions on Software Engineering, 2022, 48(1): 1−36
|
[3] |
Huang Xiaowei, Kroening D, Ruan Wenjie, et al. A survey of safety and trustworthiness of deep neural networks: Verification, testing, adversarial attack and defence, and interpretability[J]. Computer Science Review, 2020, 37: 10270
|
[4] |
王赞,闫明,刘爽,等. 深度神经网络测试研究综述[J]. 软件学报,2020,31(5):1255−1275
Wang Zan, Yan Ming, Liu Shuang, et al. Survey on testing of deep neural networks[J]. Journal of Software, 2020, 31(5): 1255−1275 (in Chinese)
|
[5] |
Xie Xiaofei, Ma Lei, Juefei-Xu F, et al. DeepHunter: A coverage guided fuzz testing framework for deep neural networks[C] //Proc of the 28th ACM SIGSOFT Int Symp on Software Testing and Analysis. New York: ACM, 2019: 146−157
|
[6] |
代贺鹏,孙昌爱,金慧,等. 面向深度学习系统的模糊测试技术研究进展[J]. 软件学报,2023, 34(11): 5008−5028
Dai Hepeng, Sun Chang’ai, Jin Hui, et al. State-of-the-art survey of fuzzing for deep learning systems[J]. Journal of Software, 2023, 34(11): 5008−5028
|
[7] |
Pei Kexin, Cao Yinzhi, Yang Junfeng, et al. DeepXplore: Automated whitebox testing of deep learning systems[C] //Proc of the 26th Symp on Operating Systems Principles. New York: ACM, 2017: 1−18
|
[8] |
Guo Jianmin, Jiang Yu, Zhao Yue, et al. DLFuzz: Differential fuzzing testing of deep learning systems[C] //Proc of the 26th Joint Meeting on European Software Engineering Conf and Symp on the Foundations of Software. New York: ACM, 2018: 739−743
|
[9] |
Kim J, Feldt R, Yoo S. Guiding deep learning system testing using surprise adequacy[C] //Proc of the 41st Int Conf on Software Engineering. Piscataway, NJ: IEEE, 2019: 1039−1049
|
[10] |
Kim J, Ju J, Feldt R, et al. Reducing DNN labelling cost using surprise adequacy: An industrial case study for autonomous driving[C] //Proc of the 28th ACM Joint Meeting on European Software Engineering Conf and Symp on the Foundations of Software Engineering. New York: ACM, 2022: 1466−1476
|
[11] |
Kim S, Yoo S. Evaluating surprise adequacy for question answering[C] //Proc of the 42nd Int Conf on Software Engineering Workshops. New York: ACM, 2020: 197−202
|
[12] |
Weiss M, Chakraborty R, Tonella P. A review and refinement of surprise adequacy[C] //Proc of the 3rd IEEE/ACM Int Workshop on Deep Learning for Testing and Testing for Deep Learning. Piscataway, NJ: IEEE, 2021: 17−24
|
[13] |
Gerasimous S, Eniser H, Sen A, et al. Importance-driven deep learning system testing[C] //Proc of the 42nd ACM/IEEE Int Conf on Software Engineering. Piscataway, NJ: IEEE, 2020: 702−713
|
[14] |
Xie Xiaofei, Li Tianlin, Wang Jian, et al. NPC: Neuron path coverage via characterizing decision Logic of deep neural networks[J]. ACM Transactions on Software Engineering and Methodology, 2022, 31(3): 47: 1−47: 27
|
[15] |
Ma Lei, Juefei-Xu F, Zhang Fuyuan, et al. DeepGauge: Multi-granularity testing criteria for deep learning systems[C] //Proc of the 33rd ACM/IEEE Int Conf on Automated Software Engineering. New York: ACM, 2018: 120−131
|
[16] |
Feng Yang, Shi Qingkai, Gao Xinyu, et al. DeepGini: Prioritizing massive tests to enhance the robustness of deep neural networks[C] //Proc of the 29th ACM SIGSOFT Int Symp on Software Testing and Analysis. New York: ACM, 2020: 177−188
|
[17] |
Wang Jingyi, Chen Jialuo, Sun Youcheng, et al. RobOT: Robustness-oriented testing for deep learning systems[C] //Proc of the 43rd Int Conf on Software Engineering. Piscataway, NJ: IEEE, 2021: 300−311
|
[18] |
Wang Dong, Wang Ziyuan, Fang Chunrong, et al. DeepPath: Path-driven testing criteria for deep neural networks[C] //Proc of the 1st IEEE Int Conf on Artificial Intelligence Testing. Piscataway, NJ: IEEE, 2019: 119−120
|
[19] |
Ma Lei, Juefei-Xu F, Xue Minhui, et al. DeepCT: Tomographic combinatorial testing for deep learning systems[C] //Proc of the 26th Int Conf on Software Analysis, Evaluation and Reengineering. Piscataway, NJ: IEEE, 2019: 614−618
|
[20] |
Du Xiaoning, Xie Xiaofei, Li Yi, et al. DeepStellar: Model-based quantitative analysis of stateful deep learning systems[C] //Proc of the 27th ACM Joint Meeting on European Software Engineering Conf and Symp on the Foundations of Software Engineering. New York: ACM, 2019: 477−487
|
[21] |
李舵,董超群,司品超,等. 神经网络验证和测试技术研究综述[J]. 计算机工程与应用,2021,57(22):53−67
Li Duo, Dong Chaoqun, Si Pinchao, et al. Survey of research on neural network verification and testing technology[J]. Computer Engineering and Applications, 2021, 57(22): 53−67 (in Chinese)
|
[22] |
Bach S, Binder A, Montavon G, et al. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation[J]. PLoS ONE, 2015, 7(10): 1−46
|
[23] |
纪守领,李进锋,杜天宇,等. 机器学习模型可解释性方法、应用与安全研究综述[J]. 计算机研究与发展,2019,56(10):2071−2096
Ji Shouling, Li Jinfeng, Du Tianyu, et al. Survey on techniques, applications and security of machine learning interpretability[J]. Journal of Computer Research and Development, 2019, 56(10): 2071−2096 (in Chinese)
|
[24] |
沐燕舟,王赞,陈翔,等. 采用多目标优化的深度学习测试优化方法[J]. 软件学报,2022,33(7):2499−2524
Mu Yanzhou, Wang Zan, Chen Xiang, et al. A deep learning test optimization method using multi-objective optimization[J]. Journal of Software, 2022, 33(7): 2499−2524 (in Chinese)
|
[25] |
LeCun Y, Cortes C. The MNIST database of handwritten digits[EB/OL]. [2022-06-10]. http://yann.lecun.com/exdb/mnist/
|
[26] |
Krizhevsky N, Vinod H, Geoffrey C, et al. The CIFAR-10 dataset[EB/OL]. [2022-06-10]. http://www.cs.toronto.edu/~kriz/cifar.html
|
[27] |
Xiao Han, Rasul K, Vollgraf R. Fashion-MNIST is a dataset of Zalando’s article images[EB/OL]. [2022-06-10]. https://github.com/zalandoresearch/fashion-mnist
|
[28] |
Udacity. Dataset wiki[EB/OL]. [2022-06-10]. https://github.com/udacity/self-driving-car/tree/master/datasets
|
[29] |
Alber M, Lapuschkin S, Seegerer P, et al. iNNvestigate neural networks![J]. Journal of Machine Learning Research, 2019, 20(93): 1−8
|
[30] |
Zhou Zhiyang, Dou Wensheng, Liu Jie, et al. DeepCon: Contribution coverage testing for deep learning systems[C] //Proc of the 28th IEEE Int Conf on Software Analysis, Evolution and Reengineering. Piscataway, NJ: IEEE, 2021: 189−200
|
[31] |
Goodfellow I, Shlens J, Szegedy C. Explaining and harnessing adversarial examples[J]. arXiv preprint, arXiv: 1412.6572, 2015
|
[32] |
Carlini N, Wagner D. Towards evaluating the robustness of neural networks[C] //Proc of the 38th IEEE Symp on Security and Privacy. Piscataway, NJ: IEEE, 2017: 39−57
|
[33] |
Lee S, Cha S, Lee D, et al. Effective white-box testing of deep neural networks with adaptive neuron-selection strategy[C] //Proc of the 29th ACM SIGSOFT Int Symp on Software Testing and Analysis. New York: ACM, 2020: 165−176
|
[34] |
Zhang Pengcheng, Ren Bin, Dong Hai, et al. CAGFuzz: Coverage-guided adversarial generative fuzzing testing for image-based deep learning systems[J]. IEEE Transactions on Software Engineering, 2021, 48(11): 4630−4646
|
[35] |
Shen Weijun, Li Yanhui, Chen Lin, et al. Multiple-boundary clustering and prioritization to promote neural network retraining[C] //Proc of the 35th IEEE/ACM Int Conf on Automated Software Engineering. Piscataway, NJ: IEEE, 2020: 410−422
|
[1] | Wang Houzhen, Qin Wanying, Liu Qin, Yu Chunwu, Shen Zhidong. Identity Based Group Key Distribution Scheme[J]. Journal of Computer Research and Development, 2023, 60(10): 2203-2217. DOI: 10.7544/issn1000-1239.202330457 |
[2] | Chen Yewang, Shen Lianlian, Zhong Caiming, Wang Tian, Chen Yi, Du Jixiang. Survey on Density Peak Clustering Algorithm[J]. Journal of Computer Research and Development, 2020, 57(2): 378-394. DOI: 10.7544/issn1000-1239.2020.20190104 |
[3] | Zhang Qikun, Gan Yong, Wang Ruifang, Zheng Jiamin, Tan Yu’an. Inter-Cluster Asymmetric Group Key Agreement[J]. Journal of Computer Research and Development, 2018, 55(12): 2651-2663. DOI: 10.7544/issn1000-1239.2018.20170651 |
[4] | Xu Xiao, Ding Shifei, Sun Tongfeng, Liao Hongmei. Large-Scale Density Peaks Clustering Algorithm Based on Grid Screening[J]. Journal of Computer Research and Development, 2018, 55(11): 2419-2429. DOI: 10.7544/issn1000-1239.2018.20170227 |
[5] | Wang Haiyan, Dong Maowei. Latent Group Recommendation Based on Dynamic Probabilistic Matrix Factorization Model Integrated with CNN[J]. Journal of Computer Research and Development, 2017, 54(8): 1853-1863. DOI: 10.7544/issn1000-1239.2017.20170344 |
[6] | Gong Shufeng, Zhang Yanfeng. EDDPC: An Efficient Distributed Density Peaks Clustering Algorithm[J]. Journal of Computer Research and Development, 2016, 53(6): 1400-1409. DOI: 10.7544/issn1000-1239.2016.20150616 |
[7] | Zhang Qikun, Wang Ruifang, Tan Yu'an. Identity-Based Authenticated Asymmetric Group Key Agreement[J]. Journal of Computer Research and Development, 2014, 51(8): 1727-1738. DOI: 10.7544/issn1000-1239.2014.20121165 |
[8] | Zhu Mu, Meng Fanrong, and Zhou Yong. Density-Based Link Clustering Algorithm for Overlapping Community Detection[J]. Journal of Computer Research and Development, 2013, 50(12): 2520-2530. |
[9] | Wang Feng, Zhou Yousheng, Gu Lize, Yang Yixian. A Multi-Policies Threshold Signature Scheme with Group Verifiability[J]. Journal of Computer Research and Development, 2012, 49(3): 499-505. |
[10] | Cao Jia, Lu Shiwen. Research on Topology Discovery in the Overlay Multicast[J]. Journal of Computer Research and Development, 2006, 43(5): 784-790. |