• 中国精品科技期刊
  • CCF推荐A类中文期刊
  • 计算领域高质量科技期刊T1类
Advanced Search
Chen Haoling, Yu Huiqun, Fan Guisheng, Li Mingchen, Huang Zijie. Class Summarization Generation Technology Based on Hierarchical Representation and Context Enhancement[J]. Journal of Computer Research and Development, 2024, 61(2): 307-323. DOI: 10.7544/issn1000-1239.202330730
Citation: Chen Haoling, Yu Huiqun, Fan Guisheng, Li Mingchen, Huang Zijie. Class Summarization Generation Technology Based on Hierarchical Representation and Context Enhancement[J]. Journal of Computer Research and Development, 2024, 61(2): 307-323. DOI: 10.7544/issn1000-1239.202330730

Class Summarization Generation Technology Based on Hierarchical Representation and Context Enhancement

Funds: This work was supported by the National Natural Science Foundation of China (62372174, 62276097) and the Research Programme of National Engineering Laboratory for Big Data Distribution and Exchange Technologies, the Shanghai Municipal Special Fund Program for Promoting High Quality Development (2021-GYHLW-01007).
More Information
  • Author Bio:

    Chen Haoling: born in 1999. Master candidate. Student member of CCF. Her main research interests include automatic code summarization and program comprehension

    Yu Huiqun: born in 1967. PhD, professor, PhD supervisor. Senior member of CCF. His main research interests include software engineering, trusted computing, cloud computing, and formal methods

    Fan Guisheng: born in 1980. PhD, associate research fellow, PhD supervisor. Member of CCF. His main research interests include software engineering, service computing, and software architecture analysis techniques

    Li Mingchen: born in 1998. PhD candidate. His main research interests include automatic code summarization and program comprehension

    Huang Zijie: born in 1994. PhD candidate. Student member of CCF. His main research interests include code smell, software quality assurance, program comprehension, and empirical software engineering

  • Received Date: September 10, 2023
  • Revised Date: December 04, 2023
  • Available Online: December 20, 2023
  • Code summarization is a natural language description of source code, and high-quality code summaries help to improve developers’ program understanding efficiency. In recent years, research on code summarization has focused on generating summaries for method-grained code snippet. However, in an object-oriented language such as Java, class is the basic programming unit. Due to the above problems, we propose a class summarization generation method based on hierarchical representation and context enhancement, called HRCE, as well as constructs a class summarization dataset containing 358992 pairs of <Java class, content, summary>. HRCE uses code simplification strategy to remove non-critical code of class to shorten the code length. Then, HRCE models the class hierarchy, including class signature, attribute and method respectively, to obtain the semantic information and hierarchical structure information of the class. In addition, HRCE selects parent’s class signature and class summary to describe the context that the class depends on in the project. Experiments show that a generative model for class summarization based on hierarchical representation and context enhancement is able to characterize the semantics and hierarchical structure of the code, and obtain information from both inside and outside of the target class. As a result, HRCE outperforms all baseline models on evaluation metrics such as BLEU, METEOR, ROUGE-L, etc.

  • [1]
    Xia Xin, Bao Lingfeng, Lo D, et al. Measuring program comprehension: A large-scale field study with professionals[J]. IEEE Transactions on Software Engineering, 2017, 44(10): 951−976
    [2]
    Haiduc S, Aponte J, Moreno L, et al. On the use of automated text summarization techniques for summarizing source code[C] //Proc of the 17th Working Conf on Reverse Engineering. Piscataway, NJ: IEEE, 2010: 35−44
    [3]
    Haiduc S, Aponte J, Marcus A, et al. Supporting program comprehension with source code summarization[C] //Proc of the 32nd ACM/IEEE Int Conf on Software Engineering. New York: ACM, 2010: 223−226
    [4]
    Eddy B P, Robinson J A, Kraft N A, et al. Evaluating source code summarization techniques: Replication and expansion[C] //Proc of the 21st Int Conf on Program Comprehension. Piscataway, NJ: IEEE, 2013: 13−22
    [5]
    McBurney P W, Liu C, McMillan C, et al. Improving topic model source code summarization[C] //Pro of the 22nd Int Conf on Program Comprehension. New York: ACM, 2014: 291−294
    [6]
    Movshovitz-Attias D, Cohen W. Natural language models for predicting programming comments[C] //Proc of the 51st Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2013: 35−40
    [7]
    Wong E, Yang Jinqiu, Tan Lin. Autocomment: Mining question and answer sites for automatic comment generation[C] //Proc of the 28th IEEE/ACM Int Conf on Automated Software Engineering. Piscataway, NJ: IEEE, 2013: 562−567
    [8]
    Sridhara G, Hill E, Muppaneni D, et al. Towards automatically generating summary comments for Java methods[C] //Proc of the 25th IEEE/ACM Int Conf on Automated Software Engineering. New York: ACM, 2010: 43−52
    [9]
    Sridhara G, Pollock L, Vijay-Shanker K. Generating parameter comments and integrating with method summaries[C] //Proc of the 19th Int Conf on Program Comprehension. Piscataway, NJ: IEEE, 2011: 71−80
    [10]
    Sridhara G, Pollock L, Vijay-Shanker K. Automatically detecting and describing high level actions within methods[C] //Proc of the 33rd Int Conf on Software Engineering. New York: ACM, 2011: 101−110
    [11]
    Abid N J, Dragan N, Collard M L, et al. Using stereotypes in the automatic generation of natural language summaries for C++ methods[C] //Proc of Int Conf on Software Maintenance and Evolution. Piscataway, NJ: IEEE, 2015: 561−565
    [12]
    Song Xiaotao, Sun Hailong, Wang Xu, et al. A survey of automatic generation of source code comments: Algorithms and techniques[J]. IEEE Access, 2019, 7: 111411−111428 doi: 10.1109/ACCESS.2019.2931579
    [13]
    Moreno L, Aponte J, Sridhara G, et al. Automatic generation of natural language summaries for Java classes[C] //Proc of the 21st Int Conf on Program Comprehension. Piscataway, NJ: IEEE, 2013: 23−32
    [14]
    Li Mingchen, Yu Huiqun, Fan Guisheng, et al. ClassSum: A deep learning model for class-level code summarization[J]. Neural Computing and Applications, 2023, 35(4): 3373−3393 doi: 10.1007/s00521-022-07877-z
    [15]
    Li Zheng, Wu Yonghao, Peng Bin, et al. Setransformer: A transformer-based code semantic parser for code comment generation[J]. IEEE Transactions on Reliability, 2022, 72(1): 258−273
    [16]
    Iyer S, Konstas I, Cheung A, et al. Summarizing source code using a neural attention model[C] //Proc of the 54th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2016: 2073−2083
    [17]
    Allamanis M, Peng H, Sutton C. A convolutional attention network for extreme summarization of source code[C] //Proc of the 33rd Int Conf on Machine Learning. New York: ACM, 2016: 2091−2100
    [18]
    Hu Xing, Li Ge, Xia Xin, et al. Deep code comment generation[C] //Proc of the 26th Conf on Program Comprehension. New York: ACM, 2018: 200−210
    [19]
    Leclair A, Jiang S, McMillan C. A neural model for generating natural language summaries of program subroutines[C] //Proc of the 41st Int Conf on Software Engineering. New York: ACM, 2019: 795−806
    [20]
    Hu Xing, Li Ge, Xia Xin, et al. Deep code comment generation with hybrid lexical and syntactical information[J]. Empirical Software Engineering, 2019, 25(3): 2179−2217
    [21]
    Zhou Ziyi, Yu Huiqun, Fan Guisheng. Effective approaches to combining lexical and syntactical information for code summarization[J]. Software: Practice and Experience, 2020, 50(12): 2313−2336 doi: 10.1002/spe.2893
    [22]
    Ahmad W U, Chakraborty S, Ray B, et al. A transformer-based approach for source code summarization[J]. arXiv preprint, arXiv: 2005. 00653, 2020
    [23]
    Alon U, Brody S, Levy O, et al. Code2Seq: Generating sequences from structured representations of code[C/OL] //Proc of the 7th Int Conf on Learning Representations. Amherst, MA: OpenReview. Net, 2019 [2023-11-30].https://openreview.net/forum?id=H1gKYo09tX
    [24]
    Zhou Yu, Shen Juanjuan, Zhang Xiaoqing, et al. Automatic source code summarization with graph attention networks[J]. Journal of Systems and Software, 2022, 188: 111257 doi: 10.1016/j.jss.2022.111257
    [25]
    张世琨,谢睿,叶蔚,等. 基于关键词的代码自动摘要[J]. 计算机研究与发展,2020,57(9):1987−2000 doi: 10.7544/issn1000-1239.2020.20190179

    Zhang Shikun, Xie Rui, Ye Wei, et al. Keyword-based source code summarization[J]. Journal of Computer Research and Development, 2020, 57(9): 1987−2000 (in Chinese) doi: 10.7544/issn1000-1239.2020.20190179
    [26]
    Zhou Ziyi, Yu Huiqun, Fan Guisheng, et al. Summarizing source code with hierarchical code representation[J]. Information and Software Technology, 2022, 143: 106761 doi: 10.1016/j.infsof.2021.106761
    [27]
    Wang Wenhua, Zhang Yuqun, Sui Yulei, et al. Reinforcement-learning-guided source code summarization using hierarchical attention[J]. IEEE Transactions on Software Engineering, 2020, 48(1): 102−119
    [28]
    Lin Chen, Ouyang Zhichao, Zhuang Junqing, et al. Improving code summarization with block-wise abstract syntax tree splitting[C] //Proc of the 2021 IEEE/ACM 29th Int Conf on Program Comprehension. Los Alamitos, CA: IEEE Computer Society, 2021: 184−195
    [29]
    Shi Ensheng, Wang Youlin, Du Lun, et al. Cast: Enhancing code summarization with hierarchical splitting and reconstruction of abstract syntax trees[C] //Proc of the 2021 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2021: 4053−4062
    [30]
    Mcburney P W, McMillan C. Automatic source code summarization of context for Java methods[J]. IEEE Transactions on Software Engineering, 2016, 42(2): 103−119 doi: 10.1109/TSE.2015.2465386
    [31]
    Hill E, Pollock L, Vijay-Shanker K. Automatically capturing source code context of NL-queries for software maintenance and reuse[C] //Proc of the 31st Int Conf on Software Engineering. Piscataway, NJ: IEEE, 2009: 232−242
    [32]
    Yu Xiaohan, Huang Quzhe, Wang Zheng, et al. Towards context-aware code comment generation[C] //Proc of Findings of the Association for Computational Linguistics: EMNLP. Stroudsburg, PA: ACL, 2020: 3938−3947
    [33]
    Wang Yanlin, Shi Ensheng, Du Lun, et al. Cocosum: Contextual code summarization with multi-relational graph neural network[J]. arXiv preprint, arXiv: 2107. 01933, 2021
    [34]
    Haque S, Leclair A, Wu Lingfei, et al. Improved automatic summarization of subroutines via attention to file context[C] //Proc of the 17th Int Conf on Mining Software Repositories. New York: ACM, 2020: 300−310
    [35]
    Bansal A, Haque S, McMillan C. Project-level encoding for neural source code summarization of subroutines[C] //Proc of the 29th Int Conf on Program Comprehension. Piscataway, NJ: IEEE, 2021: 253−264
    [36]
    Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[J]. arXiv preprint, arXiv: 1810. 04805, 2018
    [37]
    Radford A, Wu J, Child R, et al. Language models are unsupervised multitask learners[J]. OpenAI Blog, 2019, 1(8): 9
    [38]
    Brown T, Mann B, Ryder N, et al. Language models are few-shot learners[J]. Advances in Neural Information Processing Systems, 2020, 33: 1877−1901
    [39]
    Ouyang Long, Wu J, Jiang Xu, et al. Training language models to follow instructions with human feedback[J]. Advances in Neural Information Processing Systems, 2022, 35: 27730−27744
    [40]
    Feng Zhangyin, Guo Daya, Tang Duyu, et al. CodeBERT: A pre-trained model for programming and natural languages[J]. arXiv preprint, arXiv: 2002. 08155, 2020
    [41]
    Wang Yue, Wang Weishi, Joty S, et al. Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation[J]. arXiv preprint, arXiv: 2109. 00859, 2021
    [42]
    Guo Daya, Ren Shuo, Lu Shuai, et al. GraphCodeBERT: Pre-training code representations with data flow[J]. arXiv preprint, arXiv: 2009. 08366, 2020
    [43]
    OpenAI. GPT-4 technical report[J]. arXiv preprint, arXiv: 2303. 08774, 2023
    [44]
    Chen M, Tworek J, Jun H, et al. Evaluating large language models trained on code[J]. arXiv preprint, arXiv: 2107. 03374, 2021
    [45]
    Roziere B, Gehring J, Gloeckle F, et al. Code Llama: Open foundation models for code[J]. arXiv preprint, arXiv: 2308. 12950, 2023
    [46]
    Luo Ziyang, Xu Can, Zhao Pu, et al. WizardCoder: Empowering code large language models with evol-Instruct[J]. arXiv preprint, arXiv: 2306. 08568, 2023
    [47]
    Sutskever I, Vinyals O, Le Q V. Sequence to sequence learning with neural networks[J]. Advances in Neural Information Processing Systems, 2014, 27: 3104−3112
    [48]
    Hochreiter S, Schmidhuber J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735−1780 doi: 10.1162/neco.1997.9.8.1735
    [49]
    Ge Fan, Kuang Li. Keywords guided method name generation[C] //Proc of the 29th Int Conf on Program Comprehension. Piscataway, NJ: IEEE, 2021: 196−206
    [50]
    Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C] //Proc of the Annual Conf on Neural Information Processing Systems. Cambridge, MA: MIT, 2017: 5998−6008
    [51]
    Papineni K, Roukos S, Ward T, et al. BLEU: A method for automatic evaluation of machine translation[C] //Proc of the 40th Annual Meeting of the Association for Computational Linguistics. New York: ACM, 2002: 311−318
    [52]
    Denkowski M, Lavie A. METEOR universal: Language specific translation evaluation for any target language[C] //Proc of the 9th Workshop on Statistical Machine Translation. Stroudsburg, PA: ACL 2014: 376−380
    [53]
    Lin C Y. ROUGE: A package for automatic evaluation of summaries[C] // Proc of the Workshop on Text Summarization Branches Out. Stroudsburg, PA: ACL, 2004: 74−81
    [54]
    Kingma D P, Ba J. Adam: A method for stochastic optimization[J]. arXiv preprint, arXiv: 1412. 6980, 2014
    [55]
    Wu Yonghui, Schuster M, Chen Zhifeng, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation[J]. arXiv preprint, arXiv: 1609. 08144, 2014
    [56]
    Ji Ziwei, Lee N, Frieske R, et al. Survey of hallucination in natural language generation[J]. ACM Computing Surveys, 2023, 55(12): 1−38

Catalog

    Article views (262) PDF downloads (140) Cited by()

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return