ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2019, Vol. 56 ›› Issue (2): 394-409.doi: 10.7544/issn1000-1239.2019.20180103

• 软件技术 • 上一篇    下一篇

一种面向软件特征定位问题的语义相似度集成方法

何云1,李彤1,2,王炜1,2,李响1,兰微1   

  1. 1(云南大学软件学院 昆明 650091); 2(云南省软件工程重点实验室(云南大学) 昆明 650091) (hey@ynu.edu.cn)
  • 出版日期: 2019-02-01
  • 基金资助: 
    国家自然科学基金项目(61462092,61379032,61662085);云南省自然科学基金重点项目(2015FA014);云南省数据驱动的软件工程创新团队项目(2017HC012);云南大学研究生科研创新基金项目(YDY17094)

A Semantic Similarity Integration Method for Software Feature Location Problem

He Yun1, Li Tong1,2, Wang Wei1,2, Li Xiang1, Lan Wei1   

  1. 1(College of Software, Yunnan University, Kunming 650091); 2(Key Laboratory for Software Engineering of Yunnan Province (Yunnan University), Kunming 650091)
  • Online: 2019-02-01

摘要: 特征是软件系统中被需求所定义的可执行功能实体.识别软件特征与源代码间映射关系的过程被称作特征定位.基于信息检索的特征定位方法由于高易用性和低开销等优点,被广泛应于软件维护、代码搜索等领域.所有基于信息检索的特征定位方法均建立在语义相似度计算基础之上,当前语义相似度计算存在2个主要问题:第一,源代码数据中大量噪声信息对相似度计算的干扰;第二,不同索引方法局限性导致的相似度计算结果失准.针对这2个问题,提出了一种面向软件特征定位问题的语义相似度集成方法.该方法在预处理过程引入词性过滤,有效过滤源代码中噪声数据,提升相似性计算的准确度.然后,以源代码数据自身结构特性为依据,集成不同索引方法进行相似度计算.在公开数据集上进行了实验,与现有方法相比,词性过滤和相似度集成在平均排序倒数性能上分别带来了30.88%和10.28%的提升,验证了所提方法的有效性.

关键词: 特征定位, 信息检索, 语义相似度, 词性过滤, 索引方法, 集成

Abstract: Feature is an executable function entity that’s defined in software system. The process of identifying the mapping relationship between the software features and source code is called feature location. Information retrieval feature location method is widely used in software maintenance, code search and other fields because of its high usability and low overhead. All the information retrieval feature location methods are based on semantic similarity calculation. However, there are two main problems: 1) There is a lot of noise data in the source code corpus. The noise data will interfere with the result of similarity calculation. 2) Different index methods’ limitation will lead to the similarity calculation results being inaccurate. To solve these problems, a semantic similarity integration method for software feature location problem is proposed. This method introduces the Part-of-Speech filtering in the preprocessing process, effectively filtering the source code noise data, and improving the accuracy of similarity calculation. Then, different index methods are integrated to calculate similarities based on the source code’s structured characteristics. Experiments are performed on the open data benchmarks. Compared with the existing methods, the POS filtering improves by an average of 30.88% on the mean reciprocal rank performance, while similarity integration improves an average of 10.28%. The experimental result verifies the effectiveness of the proposed methods.

Key words: feature location, information retrieval, semantic similarity, POS filtering, index method, integration

中图分类号: