Citation: | Yang Guang, Zhou Yu, Chen Xiang, Zhang Xiangyu. CodeScore-R: An Automated Robustness Metric for Assessing the Functional Correctness of Code Synthesis[J]. Journal of Computer Research and Development, 2024, 61(2): 291-306. DOI: 10.7544/issn1000-1239.202330715 |
Evaluation metrics are crucial in the field of code synthesis. Commonly used code evaluation metrics can be classified into three types: match-based, semantic-based, and execution-based. Among them, the execution-based Pass@k metric accurately assesses the functionality of predicted code by executing test cases. However, calculating this metric requires a significant amount of overhead, necessitating the design of an automated evaluation metric that can assess the functionality of predicted code without the need for test cases. Additionally, a good evaluation metric should be robust, that is the metric can maintain its accuracy even when the predicted code undergoes minor changes. To address these challenges, we propose an automated robust metric, called CodeScore-R, based on UniXcoder and contrastive learning, for evaluating the functionality of code synthesis. CodeScore-R employs techniques such as sketch-based processing, syntactic-equivalent transformations, and mutation testing to effectively mitigate the interference caused by identifiers, syntax structures, and operators on evaluation results. Experimental results demonstrate that in the tasks of code generation and migration in Java and Python, CodeScore-R outperforms other evaluation metrics and is more closely aligned with the Pass@k metric, while exhibiting stronger robustness.
[1] |
Gulwani S, Polozov O, Singh R. Program synthesis[J]. Foundations and Trends in Programming Languages, 2017, 4(1-2): 1−119
|
[2] |
Ren Shuo, Guo Daya, Lu Shuai, et al. CodeBLEU: A method for automatic evaluation of code synthesis[J]. arXiv preprint, arXiv: 2009. 10297, 2020
|
[3] |
Lu Shuai, Guo Daya, Ren Shuo, et al. CodeXGLUE: A machine learning benchmark dataset for code understanding and generation[J]. arXiv preprint, arXiv: 2102. 04664, 2021
|
[4] |
Li Yujia, Choi D, Chung J, et al. Competition-level code generation with alphacode[J]. Science, 2022, 378(6624): 1092−1097 doi: 10.1126/science.abq1158
|
[5] |
Weisz J D, Muller M, Houde S, et al. Perfection not required human-AI partnerships in code translation[C]//Proc of the 26th Int Conf on Intelligent User Interfaces. New YorK: ACM, 2021: 402−412
|
[6] |
Jain N, Vaidyanath S, Iyer A, et al. Jigsaw: Large language models meet program synthesis[C]//Proc of the 44th Int Conf on Software Engineering. New York: ACM, 2022: 1219−1231
|
[7] |
Zhang Chaoning, Zhang Chenshuang, Zheng Sheng, et al. A complete survey on generative AI: Is ChatGPT from GPT-4 to GPT-5 all you need?[J]. arXiv preprint, arXiv: 2303. 11717, 2023
|
[8] |
Wang Yue, Wang Weishi, Joty S, et al. CodeT5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation[C]//Proc of the 2021 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2021: 8696−8708
|
[9] |
Nijkamp E, Pang B, Hayashi H, et al. CodeGen: An open large language model for code with multi-turn program synthesis[J]. arXiv preprint, arXiv: 2203. 13474, 2022
|
[10] |
Zheng Qinkai, Xia Xiao, Zou Xu, et al. CodeGeeX: A pre-trained model for code generation with multilingual evaluations on humaneval-x[J]. arXiv preprint, arXiv: 2303. 17568, 2023
|
[11] |
Liang Qingyuan, Sun Zeyu, Zhu Qihao, et al. Lyra: A benchmark for Turducken-style code generation[C]// Proc of the Thirty-First Int Joint Conf on Artificial Intelligence, San Francisco, CA: Morgan Kaufmann. 2022: 4238−4244
|
[12] |
Austin J, Odena A, Nye M, et al. Program synthesis with large language models[J]. arXiv preprint, arXiv: 2108. 07732, 2021
|
[13] |
Iyer S, Konstas I, Cheung A, et al. Mapping language to code in programmatic context[C]//Proc of the 2018 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2018: 1643−1652
|
[14] |
Papineni K, Roukos S, Ward T, et al. BLEU: A method for automatic evaluation of machine translation[C]//Proc of the 40th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2002: 311−318
|
[15] |
Popović M. chrF: Character n-gram F-Score for automatic MT evaluation[C]//Proc of the 10th workshop on Statistical Machine Translation. Stroudsburg, PA: ACL, 2015: 392−395
|
[16] |
Kulal S, Pasupat P, Chandra K, et al. Spoc: Search-based pseudocode to code[J]. Advances in Neural Information Processing Systems, 2019, 32: 11906−11917
|
[17] |
Niu Changan, Li Chuanyi, Luo Bin, et al. Deep learning meets software engineering: A survey on pre-trained models of source code[C]// Proc of the 31st Int Joint Conf on Artificial Intelligence. San Francisco, CA: Morgan Kaufmann, 2022: 5546−5555
|
[18] |
Zhuo T Y. Large language models are state-of-the-art evaluators of code generation[J]. arXiv preprint, arXiv: 2304. 14317, 2023
|
[19] |
Guo Daya, Lu Shuai, Duan Nan, et al. UniXcoder: Unified cross-modal pre-training for code representation[C]//Proc of the 60th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL. 2022: 7212−7225
|
[20] |
Chen M, Tworek J, Jun H, et al. Evaluating large language models trained on code[J]. arXiv preprint, arXiv: 2107. 03374, 2021
|
[21] |
Ahmad W U, Tushar M G R, Chakraborty S, et al. AVATAR: A parallel corpus for Java-Python program translation[J]. arXiv preprint, arXiv: 2108. 11590, 2021
|
[22] |
Liguori P, Improta C, Natella R, et al. Who evaluates the evaluators? On automatic metrics for assessing AI-based offensive code generators[J]. Expert Systems with Applications, 2023, 225: 120073 doi: 10.1016/j.eswa.2023.120073
|
[23] |
Lin C Y. Rouge: A package for automatic evaluation of summaries[C]//Proc of the Text Summarization Branches Out. Stroudsburg, PA: ACL, 2004: 74−81
|
[24] |
Evtikhiev M, Bogomolov E, Sokolov Y, et al. Out of the BLEU: How should we assess quality of the code generation models?[J]. Journal of Systems and Software, 2023, 203: 111741
|
[25] |
Eghbali A, Pradel M. CrystalBLEU: Precisely and efficiently measuring the similarity of code[C]//Proc of the 37th IEEE/ACM Int Conf on Automated Software Engineering. Piscataway, NJ: IEEE, 2022: 1−12
|
[26] |
Dong Yihong, Ding Jiazheng, Jiang Xue, et al. CodeScore: Evaluating code generation by learning code execution[J]. arXiv preprint, arXiv: 2301. 09043, 2023
|
[27] |
Zhou Shuyan, Alon U, Agarwal S, et al. CodeBERTScore: Evaluating code generation with pretrained models of code[J]. arXiv preprint, arXiv: 2302. 05527, 2023
|
[28] |
Feng Zhangyin, Guo Daya, Tang Duyu, et al. CodeBERT: A pre-trained model for programming and natural languages[C]//Proc of the Findings of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2020: 1536−1547
|
[29] |
Yang Guang, Zhou Yu, Chen Xiang, et al. ExploitGen: Template-augmented exploit code generation based on CodeBERT[J]. Journal of Systems and Software, 2023, 197: 111577 doi: 10.1016/j.jss.2022.111577
|
[30] |
Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[J]. Advances in Neural Information Processing Systems, 2017, 2017: 6000−6010
|
[31] |
Husain H, Wu H H, Gazit T, et al. CodeSearchNet challenge: Evaluating the state of semantic code search[J]. arXiv preprint, arXiv: 1909. 09436, 2019
|
[32] |
Gao Tianyu, Yao Xingcheng, Chen Danqi. SimCSE: Simple contrastive learning of sentence embeddings[C]//Proc of the 2021 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2021: 6894−6910
|
[33] |
Chakraborty S, Ahmed T, Ding Yangruibo, et al. NatGen: Generative pre-training by “naturalizing” source code[C]//Proc of the 30th ACM Joint European Software Engineering Conf and Symp. on the Foundations of Software Engineering. New York: ACM, 2022: 18−30
|
[34] |
Jia Yue, Harman M. An analysis and survey of the development of mutation testing[J]. IEEE Transactions on Software Engineering, 2010, 37(5): 649−678
|
[35] |
van den Heuvel E, Zhan Z. Myths about linear and monotonic associations: Pearson’s r, Spearman’s ρ, and Kendall’s τ[J]. The American Statistician, 2022, 76(1): 44−52 doi: 10.1080/00031305.2021.2004922
|
[36] |
Willmott C J, Matsuura K. Advantages of the mean absolute error over the root mean square error in assessing average model performance[J]. Climate Research, 2005, 30(1): 79−82
|