Abstract:
The summary of source code is a brief natural language description of the source code. The purpose of code summarization is to assist program understanding by automatically generating documentation, and it has potentials in many software engineering activities. The challenge of code summarization is that it resembles both machine translation and text summarization. The difficulty lies in how to better model code which is highly structural and has unlimited token vocabulary, and how to better filter key information in long code token sequence. Inspired by how humans write summaries and other related works, we propose a novel model called KBCoS (keyword-based source code summarization), which uses method signature and API call as keywords to enable the model to focus more on the key information in source code at each decoding step to generate summaries. In addition, to address the out-of-vocabulary (OOV) problem, we propose an algorithm called partial splitting, which means splitting a token into sub-tokens only when it is out of vocabulary. The algorithm is simple and effective, which can mitigate the conflict between the length of code token sequence and the number of OOV tokens. We use attention-based sequence-to-sequence model as the baseline and evaluate our approach in a public dataset of Java methods with corresponding API call sequences and summaries. The results show that both the keyword-based attention mechanism and partial splitting can improve the baseline in terms of BLEU-4, METEOR and ROUGE-L. Similar results can be found on another Python dataset. Furthermore, when combined KBCoS with TL-CodeSum, which is one of the state-of-the-art models for code summarization, KBCoS achieves the state-of-the-art result on this dataset, which indicates that our approach can help improve other models as well. Both the experimental results and the heat maps of attention weights demonstrate the effectiveness of our proposed model KBCoS.