ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2020, Vol. 57 ›› Issue (9): 1987-2000.doi: 10.7544/issn1000-1239.2020.20190179

• 人工智能 • 上一篇    下一篇



  1. 1(北京大学软件工程国家工程研究中心 北京 100871);2(北京大学软件与微电子学院 北京 100871) (
  • 出版日期: 2020-09-01

Keyword-Based Source Code Summarization

Zhang Shikun1, Xie Rui1,2, Ye Wei1, Chen Long1,2   

  1. 1(National Engineering Research Center for Software Engineering, Peking University, Beijing 100871);2(School of Software and Microelectronics, Peking University, Beijing 100871)
  • Online: 2020-09-01

摘要: 代码摘要(code summary)是对一段源代码简短的自然语言描述,代码自动摘要(code summarization)技术通过自动化地生成代码摘要辅助开发者更好地理解程序代码,该技术在许多软件开发活动中都具有重要的应用价值.代码自动摘要同时结合了机器翻译和文本摘要2个任务,如何更好地对代码建模以及如何更好地筛选代码中的关键信息是代码摘要所面临的主要挑战.受人类写摘要时的习惯和相关研究的启发,提出了一种基于关键词的代码自动摘要方法(keyword-based source code summarization, KBCoS).该方法将函数签名和API(application programming interface)调用视为关键词,并利用关键词序列来优化解码器注意力机制中的权重分布,使模型在生成摘要时更集中地关注代码中的重要信息.此外,为克服代码符号词汇表过大的问题,提出了符号部分拆分算法,即当符号不在词表中时,依据常用命名规则将符号拆成子符号的序列.该算法简单有效,能很好地平衡代码符号序列长度和未登录词数目之间的矛盾.选用了带有注意力机制的序列到序列模型作为基准模型,并在公开的Java代码摘要数据集上进行了评估.实验表明,基于关键词的注意力机制和部分拆分算法在BLEU-4,METEOR,ROUGE-L这3个评测指标上均能提升基准模型的表现.同时,在另一个Python数据集上也取得了一致的实验结果.最后,将KBCoS与现有模型相结合,在Java数据集上取得了当前最好的结果,该结果表明KBCoS也能改进现有的其他模型.评测结果和注意力权重的热力图都表明了KBCoS的有效性.

关键词: 代码自动摘要, 未登录词, 注意力机制, 关键词, 编码器-解码器, 序列到序列

Abstract: The summary of source code is a brief natural language description of the source code. The purpose of code summarization is to assist program understanding by automatically generating documentation, and it has potentials in many software engineering activities. The challenge of code summarization is that it resembles both machine translation and text summarization. The difficulty lies in how to better model code which is highly structural and has unlimited token vocabulary, and how to better filter key information in long code token sequence. Inspired by how humans write summaries and other related works, we propose a novel model called KBCoS (keyword-based source code summarization), which uses method signature and API call as keywords to enable the model to focus more on the key information in source code at each decoding step to generate summaries. In addition, to address the out-of-vocabulary (OOV) problem, we propose an algorithm called partial splitting, which means splitting a token into sub-tokens only when it is out of vocabulary. The algorithm is simple and effective, which can mitigate the conflict between the length of code token sequence and the number of OOV tokens. We use attention-based sequence-to-sequence model as the baseline and evaluate our approach in a public dataset of Java methods with corresponding API call sequences and summaries. The results show that both the keyword-based attention mechanism and partial splitting can improve the baseline in terms of BLEU-4, METEOR and ROUGE-L. Similar results can be found on another Python dataset. Furthermore, when combined KBCoS with TL-CodeSum, which is one of the state-of-the-art models for code summarization, KBCoS achieves the state-of-the-art result on this dataset, which indicates that our approach can help improve other models as well. Both the experimental results and the heat maps of attention weights demonstrate the effectiveness of our proposed model KBCoS.

Key words: code summarization, out-of-vocabulary (OOV), attention mechanism, keyword, encoder-decoder, sequence to sequence (Seq2Seq)