ISSN 1000-1239 CN 11-1777/TP

Table of Content

01 March 2020, Volume 57 Issue 3
An Automatic Method Using Hybrid Neural Networks and Attention Mechanism for Software Bug Triaging
Liu Ye, Huang Jinxiao, Ma Yutao
2020, 57(3):  461-473.  doi:10.7544/issn1000-1239.2020.20190606
Asbtract ( 1117 )   HTML ( 30)   PDF (831KB) ( 580 )  
Related Articles | Metrics
Software defect repair (also known as software bug fixing) is a necessary part of software quality assurance. In the collective-intelligence-based software development environment on the Internet, improving the efficiency and effectiveness of software bug triaging can help raise bug fixing rates and reduce maintenance costs. Nowadays, automatic bug triaging approaches based on machine learning have become mainstream, but they also have some specific problems, such as hand-crafted features and an insufficient ability to represent texts. Considering successful applications of deep learning in the field of natural language processing, researchers have recently tried to introduce deep learning into the field of automatic bug triaging, to improve the performance of predicting the right bug fixer significantly. However, different types of neural networks have their limitations. To address the problems mentioned above, in this study, we regard bug triaging as a text classification problem and propose an automatic bug triaging approach based on hybrid neural networks and an attention mechanism, called Atten-CRNN. Because Atten-CRNN combines the advantages of a convolutional neural network, a recurrent neural network, and an attention mechanism, it can capture essential text features and sequence features of bug reports more effectively and then provide more accurate fixer recommendation services for software development and maintenance. An empirical study was conducted on two popular large-scale open-source software projects, namely Eclipse and Mozilla. The experimental results obtained from over 200 000 bug reports indicate that Atten-CRNN achieves higher prediction accuracy than convolutional neural networks and recurrent neural networks, regardless of the attention mechanism.
Status Prediction for Questions Post on Technical Forums
Shen Mingzhu, Liu Hui
2020, 57(3):  474-486.  doi:10.7544/issn1000-1239.2020.20190625
Asbtract ( 850 )   HTML ( 12)   PDF (1858KB) ( 230 )  
Related Articles | Metrics
When encountered by technical problems, developers often post questions on technical forums such as Stack Overflow, and wait for satisfying answers. QA forums are also an important manifestation of Internet-based group intelligence software development. However, the questions posted in the forums may not get satisfying answers. Therefore, asking problems and passively waiting for solution is not always the best strategy. To this end, we propose a deep neural network based approach to automatically predict whether the questions can obtain satisfying answers. Knowing whether the questions can be effectively answered in advance, developers figure out the best strategy for their technical problems in advance. This approach not only takes full usage of the text information of the problems itself, but also exploits the relevant content of the inquirer of the questions. With the latest deep learning technologies, it fully exploits the intrinsic relationship between the input features and the questions’ solving status. Experimental results on the dataset provided by Stack Overflow suggest that the proposed approach can accurately predict the solving status of the questions. The precision of predicting well-answered problems is 58.87%, and the recall is 46.68% (in contrast, random guess results in a precision of 38.77%, and recall of 35.26%), better than KNN and FastText.
Collective Intelligence Based Software Engineering
Xu Lixin, Wu Huayao
2020, 57(3):  487-512.  doi:10.7544/issn1000-1239.2020.20190626
Asbtract ( 1579 )   HTML ( 58)   PDF (1354KB) ( 1481 )  
Related Articles | Metrics
Collective intelligence based software engineering (CISE) aims to solve software engineering problems by techniques that exploit collective intelligence, which includes machine collective intelligence, human collective intelligence, and their combinations. CISE provides a new perspective for solving complex software engineering problems, and has become an important part of modern software development. This paper presents a survey of CISE, which systematically reviews the applications of different collective intelligence inspired techniques on solving problems of software requirements analysis, design, coding, testing and maintenance. Future research directions and challenges in the CISE area are also discussed. The goal of this study is to establish a uniform framework of CISE and provide references for the interactions and transformations between collective intelligence techniques of different levels.
The Evolution of Software Ecosystem in GitHub
Qi Qing, Cao Jian, Liu Yancen
2020, 57(3):  513-524.  doi:10.7544/issn1000-1239.2020.20190615
Asbtract ( 1478 )   HTML ( 49)   PDF (3585KB) ( 507 )  
Related Articles | Metrics
Most software projects evolve interdependently, hence the analysis of software ecosystem has attracted the interest of many researchers. In addition to analyzing some well-known software ecosystems, the software ecosystem in GitHub, together with their features, have also been investigated by researchers in recent years. Unfortunately, the fundamental process of the evolution of software ecosystem in GitHub has not received wide attention nor have the reasons why evolution occurs. In this paper, we conduct an in-depth study on software ecosystem evolution in GitHub. Firstly, we detect the evolving ecosystem in GitHub based on a dynamic community detection method. Then, different evolution events in GitHub are identified and compared. Specifically, we draw a graph to visually show the evolutionary processes of software ecosystem that survived from 2015 to 2018. To understand why an ecosystem survives or dissolves, we perform multiple linear regression analysis and find the important correlating factors of ecosystem survival. Furthermore, we present three case studies to show the typical evolution behaviors of software ecosystem in GitHub.
Research Progress on the Development of Microservices
Wu Huayao, Deng Wenjun
2020, 57(3):  525-541.  doi:10.7544/issn1000-1239.2020.20190624
Asbtract ( 1404 )   HTML ( 87)   PDF (889KB) ( 1253 )  
Related Articles | Metrics
Microservices are the latest, and probably the most popular, technology to realize the well-known service-oriented architecture (SOA). They have been widely applied in many important industrial applications, and have also attracted increasing attentions in academia. In order to aid the effective development of high quality microservices, in this study, we present a systematic review of the microservices literature, focusing on the various software engineering activities in the development of microservices. Specifically, we collect and analyze the currently available methods, tools and practices for the requirements analysis, design and implementation, testing, and refactoring for Microservices. We also discuss the issues and opportunities in future researches of this field.
Coding-Based Performance Improvement of Distributed Machine Learning in Large-Scale Clusters
Wang Yan, Li Nianshuang, Wang Xiling, Zhong Fengyan
2020, 57(3):  542-561.  doi:10.7544/issn1000-1239.2020.20190286
Asbtract ( 1343 )   HTML ( 21)   PDF (3120KB) ( 789 )  
Related Articles | Metrics
With the growth of models and data sets, running large-scale machine learning algorithms in distributed clusters has become a common method. This method divides the whole machine learning algorithm and training data into several tasks and each task runs on different worker nodes. Then, the results of all tasks are combined by master node to get the results of the whole algorithm. When there are a large number of nodes in distributed cluster, some worker nodes, called straggler, will inevitably slow down than other nodes due to resource competition and other reasons, which makes the task time of running on this node significantly higher than that of other nodes. Compared with running replica task on multiple nodes, coded computing shows an impact of efficient utilization of computation and storage redundancy to alleviate the effect of stragglers and communication bottlenecks in large-scale machine learning cluster.This paper introduces the research progress of solving the straggler issues and improving the performance of large-scale machine learning cluster based on coding technology. Firstly, we introduce the background of coding technology and large-scale machine learning cluster. Secondly, we divide the related research into several categories according to application scenarios: matrix multiplication, gradient computing, data shuffling and some other applications. Finally, we summarize the difficulties of applying coding technology in large-scale machine learning cluster and discuss the future research trends about it.
A Feature Extraction Based Recommender Algorithm Fusing Semantic Analysis
Chen Jiaying, Yu Jiong, Yang Xingyao
2020, 57(3):  562-575.  doi:10.7544/issn1000-1239.2020.20190189
Asbtract ( 1243 )   HTML ( 39)   PDF (5008KB) ( 806 )  
Related Articles | Metrics
Recommender system is an effective way to deal with the problem of personalized recommendations. Most existing recommendation methods have insufficient power to analysize inherent characteristics of users and items. To alleviate the problem, a feature extraction based recommender algorithm that fuses semantic analysis is proposed in this paper, which involves knowledge graph as heterogeneous information to enhance semantic analysis of collaborative filtering. First of all, the named entity recognition (NER) and entity linking (EL) are used to extract entities and relations about a certain item from its unstructured text information, and we construct a subgraph based on these identified entities and relations. Then we embed the subgraph to a low latent vector space by the technology of knowledge graph embedding for an easier expression. After that, the embedding results are used to represent users and items, and we design a knowledge aware collaborative learning framework to learn the fine-grained features of users and items. Finally, the embedding results are used to make Top-N recommendations for a target user. Experimental results based on two datasets show that our new framework is able to improve the recommender accuracy compared with some state-of-the-art models. It means that our new method is able to recommender items which are better matches in users’ preferences.
Averaged Weighted Double Deep Q-Network
Wu Jinjin, Liu Quan, Chen Song, Yan Yan
2020, 57(3):  576-589.  doi:10.7544/issn1000-1239.2020.20190159
Asbtract ( 863 )   HTML ( 13)   PDF (4628KB) ( 259 )  
Related Articles | Metrics
The instability and variability of deep reinforcement learning algorithms have an important effect on their performance. Deep Q-Network is the first algorithm to combine deep neural networks with Q-learning successfully. It is proved that deep Q-Network can perform human-level control for handling problems requiring both rich perception of high-dimensional raw inputs and policy control. However, deep Q-Network has the problem of overestimating the action value and such overestimation can degrade the performance of agent. Although double deep Q-Network is proposed to mitigate the impact of overestimation, it still exists the problem of underestimating the value of the action. In some complex reinforcement learning environments, even a small estimation error may have a large impact on the learned policy. In this paper, in order to solve the problem of overestimating the action value in deep Q-Network and the underestimation of the action value in double deep Q-Network, a new deep reinforcement learning framework is proposed-AWDDQN, which integrates the newly proposed weighted double estimator into double deep Q-Network. In order to reduce the estimation error of the target value, the average value of the previously learned action estimation values is calculated to generate a target value and the number of average action values is dynamically determined based on the temporal difference error. The experimental results show that AWDDQN can effectively reduce the bias and can enhance agent’s performance in some Atari 2600 games.
Graph Embedding Based Session Perception Model for Next-Click Recommendation
Zeng Yifu, Mu Qilin, Zhou Le, Lan Tian, Liu Qiao
2020, 57(3):  590-603.  doi:10.7544/issn1000-1239.2020.20190188
Asbtract ( 1303 )   HTML ( 27)   PDF (3670KB) ( 800 )  
Related Articles | Metrics
Predicting users’ next-click according to their historical session records, also known as session-based recommendation, is an important and challenging task and has led to a considerable amount of work towards this aim. Several significant progresses have been made in this area, but some fundamental problems still remain open, such as the trade-off between users’ satisfaction and predictive accuracy of the models. In this study, we consider the problem of how to alleviate user interests drift without sacrificing the predictive accuracy. For this purpose, we first set up an item dependency graph to represent the click sequence of items from a global, statistical perspective. Then an efficient graph embedding learning algorithm is proposed to produce item embeddings which preserve the information flow properties of the system and the structural dependency between each pair of items. Finally, the proposed model is capable of capturing the users’ general interests and their temporal browsing interests simultaneously by using of a BiLSTM based long/short term memory mechanism. Experimental results on two real-world data sets show that the proposed model not only performs better in terms of predictive accuracy but also demonstrates better diversity and novelty in its recommendations as compared with other state-of-the-art methods.
Visual Feature Attribution Based on Adversarial Feature Pairs
Zhang Xian, Shi Canghong, Li Xiaojie
2020, 57(3):  604-615.  doi:10.7544/issn1000-1239.2020.20190256
Asbtract ( 879 )   HTML ( 23)   PDF (4540KB) ( 485 )  
Related Articles | Metrics
Visualizing the key feature of images is an important issue and requires in-depth study for computer vision. Its application ranges from weak supervision in the object localization task to understanding in the hidden features of the data. In medical and natural images data sets, the convolutional neural network-based model has become the latest technology for visualizing the regions of input, which are important for predictions from these models or visual explanations. However, their feature location is not accurate. In view of the limitations of the traditional neural network classifier in the region of the visual image key characteristics, we propose an effective adversarial feature pairs based method for visual feature attribution. In the proposed method, We firstly construct adversarial pair of key feature areas as the input of generative adversarial network (GAN). This makes the generator produce high corresponding key features, and can effectively filter redundant information and achieve accurate position. However, traditional GAN is difficult to produce images that are similar to real images. Therefore, Wasserstein distance and gradient penalty are employed to solve the problem and accelerate the convergence process. Experimental results on synthetic datasets, lung datasets and heart datasets show that our proposed method produces convincing real-world effects in both qualitative and quantitative visual displays.
A Gradual Sensitive Indistinguishable Based Location Privacy Protection Scheme
Wang Bin, Zhang Lei, Zhang Guoyin
2020, 57(3):  616-630.  doi:10.7544/issn1000-1239.2020.20190086
Asbtract ( 854 )   HTML ( 12)   PDF (5266KB) ( 356 )  
Related Articles | Metrics
When utilizing the location based service along the movement, the reported location of users will emerge a gradual ascending of sensitivity by characteristics of moving to the target. With the trend of sensitivity ascending, the adversary can identify the destination of a particular and even some other privacy information that jeopardize the security of the user. In order to cope with this type of attack and from acquiring, this paper proposes an ε-sensitive indistinguishable algorithm based on the conception of generalized differential privacy and Voronoi diagram. In this algorithm, the current region is divided by Voronoi diagram to calculate the value of location sensitivity then grids of diagram sensitivity that the contour will be generated, so dummies are added to current grids to achieve ε-sensitive indistinguishable for users in each grid. As a result, the gradually ascending sensitivity value of any particular user will be difficult to identify, and the privacy can be protected. However, the results of simulation experiment deployed in both of Euclidean space and road network showed that, plenty of dummy locations would affect the quality of location service in both of execution and calculation times, so an improvement with location shift version of this algorithm is proposed. At last, through the security analysis for the model of ε-sensitive indistinguishable as well as the experimental verification for two versions of this algorithm, this algorithm is better to be deployed in real environment and has a better level of location privacy than other similar algorithms. Accordingly, this algorithm can prevent the user from attacks of the adversary utilizing the trend of sensitivity ascending and protect the privacy of the user during the continuous movement.
A Dynamic Stain Analysis Method on Maximal Frequent Sub Graph Mining
Guo Fangfang, Wang Xinyue, Wang Huiqiang, Lü Hongwu, Hu Yibing, Wu Fang, Feng Guangsheng, Zhao Qian
2020, 57(3):  631-638.  doi:10.7544/issn1000-1239.2020.20180846
Asbtract ( 927 )   HTML ( 11)   PDF (1067KB) ( 228 )  
Related Articles | Metrics
The malicious code recognition method on traditional dynamic stain analysis technology has many problems such as huge number of malicious code behavior dependency graphs (MBDG) and long time of matching process.According to the common characteristics of each malicious code family, the behavior dependency graph is represented by some common sub graph parts. Therefore, this paper proposes a malicious code behavior dependency graph mining method based on maximum frequent sub graphs. The method mines the largest frequent sub graphs which can represent the significant common features of the family from the malicious code family behavior dependency graph. The maximum frequent sub graph that is mined can represent the most significant common feature among the variants of this type of malicious code. The target behavior dependency graph just needs to be matched with the largest frequent sub graph after mining.Besides, the method reduces the number of behavior dependency graphs and improves the recognition efficiency without losing the characteristics of malicious code behavior. Compared with the traditional dynamic stain analysis method for malicious code recognition, when the minimum support is 0.045, the number of behavior dependency graphs decreases by 82%, the recognition efficiency increases by 81.7%, and the accuracy rate is 92.15%.
A Low-Coupling Method in Sensor-Cloud Systems Based on Edge Computing
Liang Yuzhu, Mei Yaxin, Yang Yi, Ma Ying, Jia Weijia, Wang Tian
2020, 57(3):  639-648.  doi:10.7544/issn1000-1239.2020.20190588
Asbtract ( 1107 )   HTML ( 28)   PDF (2103KB) ( 427 )  
Related Articles | Metrics
The rapid development of the IoT and cloud computing has spawned a new network structure-sensor cloud. Sensor cloud is the combination of IoT and cloud computing. Physical sensor nodes in the IoT can be virtualized into multiple nodes through the sensor cloud platform to provide services to users. However, when one sensor node receives multiple service commands at the same time, some service conflicts occur, which named coupling problems. This coupling problem can lead to the failure of services and compromise system security. In order to solve this problem, this paper proposes an extended KM (Kuhn-Munkres) algorithm based on edge computing. Edge computing is an emerging computational paradigm, increasingly utilized in IoT applications, particularly those that cannot be served efficiently using cloud computing due to limitations such as latency. The edge computing platform acts as a middleware platform and provides the scheduling method. Firstly, the edge computing layer merges the similar commands to reduce the downward transmission commands. Secondly, the buffered data in the edge computing layer is scheduled. Finally, the extended KM algorithm is used to achieve the maximum matching of each round. The theoretical analysis and experimental results show that the proposed method can improve the utilization of resources, reduce the calculation cost, and solve the coupling problem in a minimum time.
Optimization of the Key-Value Storage System Based on Fused User-Level I/O
An Zhongqi, Zhang Yunyao, Xing Jing, Huo Zhigang
2020, 57(3):  649-659.  doi:10.7544/issn1000-1239.2020.20180799
Asbtract ( 814 )   HTML ( 14)   PDF (2418KB) ( 451 )  
Related Articles | Metrics
The traditional distributed key-value storage systems are commonly designed around the conventional Socket and POSIX I/O interfaces. Limited by the interface semantics and OS kernel overhead, it is difficult for such key-value systems to achieve high efficiency on modern high-performance network and storage hardware. In this paper, we propose a fused user-level I/O approach to improve the throughput performance and latency consistency for key-value systems based on high-speed Ethernet and NVMe SSDs. The control plane of the proposed I/O stack utilizes one single processor core and one single context to cooperatively manage the hardware queues of both the NIC and the SSD devices. The overheads of kernel mode entering, interrupts and context switches and inter-core communications are eliminated. The data plane is driven by a unified memory pool for fused I/O access, and the data is directly transferred between the key-value system and the device hardware without extra data copies. For requests with large-size payload, data is sliced and fed into different DMA stages and the latency is further hidden through pipelining and overlapping. We present UKV, an all-in-userland key-value system with support of a two-level DRAM-SSD storage hierarchy and the widely-used Memcache interface. The experimental results indicate that, compared with Fatcache, the QPS of SSD-involved SET requests is increased by 14.97%~97.78%, and the QPS of the GET operation is increased by 14.60%~51.81%. The p95 latency of SSD-involved SET requests is reduced by 26.12%~40.90%, and the p95 latency of GET operations is reduced by 15.10%~24.36%.
A Consistency Mechanism for Distributed Persistent Memory File System
Chen Bo, Lu Youyou, Cai Tao, Chen Youmin, Tu Yaofeng, Shu Jiwu
2020, 57(3):  660-667.  doi:10.7544/issn1000-1239.2020.20190074
Asbtract ( 1080 )   HTML ( 24)   PDF (1999KB) ( 804 )  
Related Articles | Metrics
Persistent memory and RDMA (remote direct memory access) provide high bandwidth and low latency to storage systems, and this brings new opportunities for designing high performance distributed storage system. However, their new features raise many challenges for data consistency management. On the one hand, to consistently update data in persistent memory, one needs to actively execute hardware instructions to flush data out of the CPU cache, and such instructions can lead to extremely high overhead and seriously affect the CPU performance. On the other hand, RDMA can directly read and write remote memory without the involvements of the remote CPU. Therefore, the server CPU is unaware of the remote writing events thus fails to perform data flushing. In case of system failures, the data will be in an inconsistent state. Regarding the above two problems, this paper proposes CCM, a consistency mechanism for distributed persistent memory file system. Firstly, we design and implement a consistency strategy based on persistent operation log to maintain system consistency by writing operation information to log and persisting it. Secondly, we design a consistency strategy from client side to server side, which enables the remote CPU to actively flush data when the data transferring is completed. Lastly, we implement an asynchronous data flushing at server side to improve system performance. Our experimental results show that the write bandwidth can occupy 88% of network’s raw bandwidth. Compared with Octopus, the state-of-the-art distributed file system, CCM only shows a performance reduction of less than 1%.