Abstract:
Evaluation metrics are crucial in the field of code synthesis. Commonly used code evaluation metrics can be classified into three types: match-based, semantic-based, and execution-based. Among them, the execution-based Pass@k metric accurately assesses the functionality of predicted code by executing test cases. However, calculating this metric requires a significant amount of overhead, necessitating the design of an automated evaluation metric that can assess the functionality of predicted code without the need for test cases. Additionally, a good evaluation metric should be robust, that is the metric can maintain its accuracy even when the predicted code undergoes minor changes. To address these challenges, we propose an automated robust metric, called CodeScore-R, based on UniXcoder and contrastive learning, for evaluating the functionality of code synthesis. CodeScore-R employs techniques such as sketch-based processing, syntactic-equivalent transformations, and mutation testing to effectively mitigate the interference caused by identifiers, syntax structures, and operators on evaluation results. Experimental results demonstrate that in the tasks of code generation and migration in Java and Python, CodeScore-R outperforms other evaluation metrics and is more closely aligned with the Pass@k metric, while exhibiting stronger robustness.