CodeScore-R：用于评估代码合成功能准确性的自动化鲁棒指标

杨光; 周宇; 陈翔; 张翔宇

doi:10.7544/issn1000-1239.202330715

CodeScore-R：用于评估代码合成功能准确性的自动化鲁棒指标

CodeScore-R: An Automated Robustness Metric for Assessing the Functional Correctness of Code Synthesis

摘要

摘要: 评估指标在代码合成领域中至关重要. 常用的代码评估指标可以分为3种类型：基于匹配、基于语义和基于执行. 其中，基于执行的Pass@k指标通过执行测试用例，能够准确判断预测代码的功能准确性. 然而，该指标的计算需要大量开销，因此亟需设计一种自动化评估指标，在无需测试用例时仍可评估预测代码的功能准确性. 此外，好的评估指标应当具有鲁棒性，即预测代码发生微小改变时，评估指标仍能保持其准确性. 为此，提出了一种基于UniXcoder和对比学习的自动化鲁棒指标CodeScore-R，用于评估代码合成的功能准确性. CodeScore-R采用草图化处理、语法等价转换和变异测试等技术手段，有效减轻了标识符、语法结构和运算符对评估结果的干扰. 实验结果表明，在Java和Python语言上的代码生成和迁移任务中，CodeScore-R的表现优于其他无需测试用例的评估指标，且更接近Pass@k指标，并具有更强的鲁棒性.

Abstract: Evaluation metrics are crucial in the field of code synthesis. Commonly used code evaluation metrics can be classified into three types: match-based, semantic-based, and execution-based. Among them, the execution-based Pass@k metric accurately assesses the functionality of predicted code by executing test cases. However, calculating this metric requires a significant amount of overhead, necessitating the design of an automated evaluation metric that can assess the functionality of predicted code without the need for test cases. Additionally, a good evaluation metric should be robust, that is the metric can maintain its accuracy even when the predicted code undergoes minor changes. To address these challenges, we propose an automated robust metric, called CodeScore-R, based on UniXcoder and contrastive learning, for evaluating the functionality of code synthesis. CodeScore-R employs techniques such as sketch-based processing, syntactic-equivalent transformations, and mutation testing to effectively mitigate the interference caused by identifiers, syntax structures, and operators on evaluation results. Experimental results demonstrate that in the tasks of code generation and migration in Java and Python, CodeScore-R outperforms other evaluation metrics and is more closely aligned with the Pass@k metric, while exhibiting stronger robustness.

HTML全文

参考文献(36)

施引文献

资源附件(0)