2022 Vol. 59 No. 1
2022, 59(1): 1-21.
DOI: 10.7544/issn1000-1239.20200629
Abstract:
In the era of the rapid increase in network speed, memory access bottleneck has been a prominent problem, and network processing overhead has been increasing significantly. Ordinary network cards have gradually exposed defects in network protocol processing, data migration, and programmable flexibility in modern applications. As a programmable intelligent network device, the SmartNIC (smart network interface card) has received extensive attention in the fields of data center and scientific computing cluster applications. SmartNIC has become a key technology to solve network bottlenecks. It can bring significant benefits in terms of protocol processing offload, network function virtualization, application-specific acceleration, and other usage scenarios. We survey the basic architecture, programming framework, application direction, and other hot research issues of SmartNIC. We summarize the typical products in the current industry and important achievements in academia. We also clarify the advantages and disadvantages of different design architectures. The application scenarios applicable to different programming frameworks are introduced, and the value of SmartNIC in typical data center applications and scientific computing applications is introduced. After that, we give some efficient suggestions about software and hardware collaborative design of SmartNIC in different application scenarios. Finally, we present the hot issues that still exist in the design and use of SmartNIC, and put forward the important future research points and a universal design solution for SmartNIC.
In the era of the rapid increase in network speed, memory access bottleneck has been a prominent problem, and network processing overhead has been increasing significantly. Ordinary network cards have gradually exposed defects in network protocol processing, data migration, and programmable flexibility in modern applications. As a programmable intelligent network device, the SmartNIC (smart network interface card) has received extensive attention in the fields of data center and scientific computing cluster applications. SmartNIC has become a key technology to solve network bottlenecks. It can bring significant benefits in terms of protocol processing offload, network function virtualization, application-specific acceleration, and other usage scenarios. We survey the basic architecture, programming framework, application direction, and other hot research issues of SmartNIC. We summarize the typical products in the current industry and important achievements in academia. We also clarify the advantages and disadvantages of different design architectures. The application scenarios applicable to different programming frameworks are introduced, and the value of SmartNIC in typical data center applications and scientific computing applications is introduced. After that, we give some efficient suggestions about software and hardware collaborative design of SmartNIC in different application scenarios. Finally, we present the hot issues that still exist in the design and use of SmartNIC, and put forward the important future research points and a universal design solution for SmartNIC.
2022, 59(1): 22-30.
DOI: 10.7544/issn1000-1239.20200314
Abstract:
In recent years, as Moore’s Law approaches the limit, the development of system on chip (SoC) has encountered bottlenecks. Integrating more functional units and larger on-chip storage makes the chip area increase sharply, resulting in a decrease in chip’s yield, which in turn increases costs. In order to break through the limitations of Moore’s Law, research institutions and chip manufacturers began to seek to use advanced connection and packaging technology to disassemble the original chip into multiple smaller, higher-yielding, and cost-effective Chiplets and then reassemble them. This packaging technology is similar to the system in package (SiP) of the chip. At present, there is no unified standard for the packaging methods of Chiplets, the feasible solutions include chip splicing through silicon bridges or chip connection through interposers, etc., which can be divided into 2D, 2.5D, 3D according to the packaging structure. By summarizing the currently released Chiplets products, we discuss the advantages and disadvantages of each structure. In addition, the communication structure between multiple Chiplets is also the focus of research. How to implement a traditional bus or network on chip (NoC) on Chiplets? This paper explores the development trend and direction of Chiplets in the future through discussion of existing technologies.
In recent years, as Moore’s Law approaches the limit, the development of system on chip (SoC) has encountered bottlenecks. Integrating more functional units and larger on-chip storage makes the chip area increase sharply, resulting in a decrease in chip’s yield, which in turn increases costs. In order to break through the limitations of Moore’s Law, research institutions and chip manufacturers began to seek to use advanced connection and packaging technology to disassemble the original chip into multiple smaller, higher-yielding, and cost-effective Chiplets and then reassemble them. This packaging technology is similar to the system in package (SiP) of the chip. At present, there is no unified standard for the packaging methods of Chiplets, the feasible solutions include chip splicing through silicon bridges or chip connection through interposers, etc., which can be divided into 2D, 2.5D, 3D according to the packaging structure. By summarizing the currently released Chiplets products, we discuss the advantages and disadvantages of each structure. In addition, the communication structure between multiple Chiplets is also the focus of research. How to implement a traditional bus or network on chip (NoC) on Chiplets? This paper explores the development trend and direction of Chiplets in the future through discussion of existing technologies.
2022, 59(1): 31-46.
DOI: 10.7544/issn1000-1239.20200503
Abstract:
Traditional cache replacement policies are mainly based on heuristics. In recent years, researchers have used prediction technologies to improve the performance of cache replacement. The application of prediction technologies is gradually becoming one research focus in cache replacement. Because the behaviors of loads and stores are complex, predicting these behaviors in caching systems is difficult with uncertainty. Some existing approaches have been proposed to resolve the problem with more and more complicated prediction algorithms. However, these methods cannot reduce uncertainty, and these methods cannot avoid the interference of out-of-order execution and cache prefetching at the same time. To solve these problems, we propose an approach to predict future memory reference, named IFAPP (instruction flow access pattern prediction). IFAPP recognizes loads and stores in programs based on the instructions flow predicted by branch prediction, and then IFAPP predicts the behavior of each of the loads and stores. IFAPP calculates reuse distance through predicted memory reference, and evicts the candidate with the largest reuse distance. IFAPP avoids the interference of out-of-order execution and cache prefetching. Besides, the objects of prediction are single loadstore behaviors which are easy to predict. Both of these alleviate the uncertainty of caching predictions. The evaluations prove that IFAPP reduces the cache misses by 3.2% compared with LRU in L1D. Compared with BRRIP and BIP, IFAPP reduces the cache misses by 12.3% and 14.4% in L1D.
Traditional cache replacement policies are mainly based on heuristics. In recent years, researchers have used prediction technologies to improve the performance of cache replacement. The application of prediction technologies is gradually becoming one research focus in cache replacement. Because the behaviors of loads and stores are complex, predicting these behaviors in caching systems is difficult with uncertainty. Some existing approaches have been proposed to resolve the problem with more and more complicated prediction algorithms. However, these methods cannot reduce uncertainty, and these methods cannot avoid the interference of out-of-order execution and cache prefetching at the same time. To solve these problems, we propose an approach to predict future memory reference, named IFAPP (instruction flow access pattern prediction). IFAPP recognizes loads and stores in programs based on the instructions flow predicted by branch prediction, and then IFAPP predicts the behavior of each of the loads and stores. IFAPP calculates reuse distance through predicted memory reference, and evicts the candidate with the largest reuse distance. IFAPP avoids the interference of out-of-order execution and cache prefetching. Besides, the objects of prediction are single loadstore behaviors which are easy to predict. Both of these alleviate the uncertainty of caching predictions. The evaluations prove that IFAPP reduces the cache misses by 3.2% compared with LRU in L1D. Compared with BRRIP and BIP, IFAPP reduces the cache misses by 12.3% and 14.4% in L1D.
2022, 59(1): 47-80.
DOI: 10.7544/issn1000-1239.20201055
Abstract:
In recent years, the application of deep learning related to graph structure data has attracted more and more attention. The emergence of graph neural network has made major breakthroughs in the above tasks, such as social networking, natural language processing, computer vision, even life sciences and other fields. The graph neural network can treat the actual problem as the connection between nodes in the graph and the message propagation problem, and the dependence between nodes can be modeled, so that the graph structure data can be handled well. In view of this, the graph neural network model and its application are systematically reviewed. Firstly, the graph convolutional neural network is explained from three aspects: spectral domain, spatial domain and pooling. Then, the graph neural network model based on the attention mechanism and autoencoder is described, and some graph neural network implemented by other methods are supplemented. Secondly, it summarizes the discussion and analysis on whether the graph neural network can be bigger and deeper. Furthermore, four frameworks of graph neural network are summarized. It also explains in detail the application of graph neural network in natural language processing and computer vision, etc. Finally, the future research of graph neural network is prospected and summarized. Compared with existing review articles on graph neural network, it elaborates the knowledge of spectral theory in detail, and comprehensively summarizes the development history of graph convolutional neural network based on the spectral domain. At the same time, a new classification standard, an improved model for the low efficiency of the spatial domain graph convolutional neural network, is given. And for the first time, it summarizes the discussion and analysis of graph neural network expression ability, theoretical guarantee, etc., and adds a new framework model. In the application part, the latest application of graph neural network is explained.
In recent years, the application of deep learning related to graph structure data has attracted more and more attention. The emergence of graph neural network has made major breakthroughs in the above tasks, such as social networking, natural language processing, computer vision, even life sciences and other fields. The graph neural network can treat the actual problem as the connection between nodes in the graph and the message propagation problem, and the dependence between nodes can be modeled, so that the graph structure data can be handled well. In view of this, the graph neural network model and its application are systematically reviewed. Firstly, the graph convolutional neural network is explained from three aspects: spectral domain, spatial domain and pooling. Then, the graph neural network model based on the attention mechanism and autoencoder is described, and some graph neural network implemented by other methods are supplemented. Secondly, it summarizes the discussion and analysis on whether the graph neural network can be bigger and deeper. Furthermore, four frameworks of graph neural network are summarized. It also explains in detail the application of graph neural network in natural language processing and computer vision, etc. Finally, the future research of graph neural network is prospected and summarized. Compared with existing review articles on graph neural network, it elaborates the knowledge of spectral theory in detail, and comprehensively summarizes the development history of graph convolutional neural network based on the spectral domain. At the same time, a new classification standard, an improved model for the low efficiency of the spatial domain graph convolutional neural network, is given. And for the first time, it summarizes the discussion and analysis of graph neural network expression ability, theoretical guarantee, etc., and adds a new framework model. In the application part, the latest application of graph neural network is explained.
2022, 59(1): 81-104.
DOI: 10.7544/issn1000-1239.20200848
Abstract:
Knowledge tracing is an important research direction in the field of educational data mining. The goal is to determine the degree of students mastery of knowledge by establishing a model of students knowledge changes over time and to mine potential learning rules from their learning trajectories. Fulfilling this goal means personalized guidance to students from the achievement of assisted education through artificial intelligence. Due to its powerful feature extraction capabilities, deep learning has been proven to significantly improve the performance of knowledge tracing models and has attracted more and more attention. Starting from the most basic deep knowledge tracing model, this paper comprehensively reviews the research progress in this field and provides both the technical improvement and an evolutionary map. The 3 main technical improvement directions have been elaborated and compared: 1) improvement of interpretable problems, 2) problems of long-term dependence, and 3) improvement for lack of learning features. At the same time, the existing models in the field have been classified, the public data sets have been sorted out, and the main areas of application are investigated for researchers. Finally, the future research direction of knowledge tracing based on deep learning is explored.
Knowledge tracing is an important research direction in the field of educational data mining. The goal is to determine the degree of students mastery of knowledge by establishing a model of students knowledge changes over time and to mine potential learning rules from their learning trajectories. Fulfilling this goal means personalized guidance to students from the achievement of assisted education through artificial intelligence. Due to its powerful feature extraction capabilities, deep learning has been proven to significantly improve the performance of knowledge tracing models and has attracted more and more attention. Starting from the most basic deep knowledge tracing model, this paper comprehensively reviews the research progress in this field and provides both the technical improvement and an evolutionary map. The 3 main technical improvement directions have been elaborated and compared: 1) improvement of interpretable problems, 2) problems of long-term dependence, and 3) improvement for lack of learning features. At the same time, the existing models in the field have been classified, the public data sets have been sorted out, and the main areas of application are investigated for researchers. Finally, the future research direction of knowledge tracing based on deep learning is explored.
2022, 59(1): 105-117.
DOI: 10.7544/issn1000-1239.20200765
Abstract:
Cross-domain training tasks are currently an open challenge in the field of machine learning. At present, the latest researches are discussing the use of the cross-domain invariance of real features to predict unknown domain data, so as to achieve cross-domain generalization capabilities. But in fact, when it is known that the data comes from a certain domain, the comprehensive use of real features and false features will achieve better prediction results. This paper focuses on this issue and designs a learning model that is suitable for both cross-domain generalization and adaptation tasks(CDGA). The core of the model is still to separate the real features, so this paper proposes a new more stable training risk function, which not only has a higher test accuracy in the cross-domain generalization problem, but also overcomes the shortcomings of traditional methods that are easy to overfit, so it can be well embedded in the CDGA model. In addition, through the designed training method, the data expression part of the CDGA model can effectively separate the real features and false features, and the classifier part adaptively learns to select the generalized classifier or the classifier of the specific environment, thereby combining the application of false features to achieve efficient prediction in cross-domain tasks. Finally, it is tested on the constructed Colored MNIST data set, and the results are significantly better than the existing methods.
Cross-domain training tasks are currently an open challenge in the field of machine learning. At present, the latest researches are discussing the use of the cross-domain invariance of real features to predict unknown domain data, so as to achieve cross-domain generalization capabilities. But in fact, when it is known that the data comes from a certain domain, the comprehensive use of real features and false features will achieve better prediction results. This paper focuses on this issue and designs a learning model that is suitable for both cross-domain generalization and adaptation tasks(CDGA). The core of the model is still to separate the real features, so this paper proposes a new more stable training risk function, which not only has a higher test accuracy in the cross-domain generalization problem, but also overcomes the shortcomings of traditional methods that are easy to overfit, so it can be well embedded in the CDGA model. In addition, through the designed training method, the data expression part of the CDGA model can effectively separate the real features and false features, and the classifier part adaptively learns to select the generalized classifier or the classifier of the specific environment, thereby combining the application of false features to achieve efficient prediction in cross-domain tasks. Finally, it is tested on the constructed Colored MNIST data set, and the results are significantly better than the existing methods.
2022, 59(1): 118-126.
DOI: 10.7544/issn1000-1239.20200626
Abstract:
Many tasks in natural language understanding, such as natural language inference, question answering, and paraphrasing can be viewed as short text matching problems. Recently, the emergence of a large number of datasets and deep learning models has made great success in short text matching. However, little study has been done on analyzing the generalization of these datasets across different text matching tasks, and how to leverage these supervised datasets of multiple domains to new domains to reduce the cost of annotating and improve their performance. In this paper, we conduct an extensive investigation of generalization and transfer across different datasets and show the factors that affect the generalization through visualization. Specially, we experiment with a conventional neural semantic matching model ESIM (enhanced sequential inference model) and a pre-trained language model BERT (bidirectional encoder representations from transformers) over 10 common datasets. We show that even BERT which is pre-trained on a large-scale dataset can still improve performance on the target dataset through transfer learning. Following our analysis, we also demonstrate that pre-training on multiple datasets shows good generalization and transfer. In the case of a new domain and few-shot setting, BERT which we pre-train on the multiple datasets first and then transfers to new datasets achieves exciting performance.
Many tasks in natural language understanding, such as natural language inference, question answering, and paraphrasing can be viewed as short text matching problems. Recently, the emergence of a large number of datasets and deep learning models has made great success in short text matching. However, little study has been done on analyzing the generalization of these datasets across different text matching tasks, and how to leverage these supervised datasets of multiple domains to new domains to reduce the cost of annotating and improve their performance. In this paper, we conduct an extensive investigation of generalization and transfer across different datasets and show the factors that affect the generalization through visualization. Specially, we experiment with a conventional neural semantic matching model ESIM (enhanced sequential inference model) and a pre-trained language model BERT (bidirectional encoder representations from transformers) over 10 common datasets. We show that even BERT which is pre-trained on a large-scale dataset can still improve performance on the target dataset through transfer learning. Following our analysis, we also demonstrate that pre-training on multiple datasets shows good generalization and transfer. In the case of a new domain and few-shot setting, BERT which we pre-train on the multiple datasets first and then transfers to new datasets achieves exciting performance.
2022, 59(1): 127-143.
DOI: 10.7544/issn1000-1239.20200562
Abstract:
As a new type of data, streaming data has been applied in various application fields. Its fast, massive and continuous characteristics make single pass and accurate scanning become essential features of online learning. In the process of continuous generation of streaming data, concept drift often occurs. At present, the research on concept drift detection is relatively mature. However, in reality, the development of learning environment factors in different directions often leads to the diversity of concept drift class in streaming data, which brings new challenges to streaming data mining and online learning. To solve this problem, this paper proposes a concept drift class detection method based on time window (CD-TW). In this method, stack and queue are used to access the data, and window mechanism is used to learn streaming data in chunks. This method detects concept drift site by creating two basic site time windows which load historical data and current data respectively and comparing the distribution changes of the data contained in them. Then, a span time window loading partial data after drift site is created. The drift span is obtained by analyzing the distribution stability of the data in span time window, which is further used to judge the concept drift class. The results of experiment demonstrate that CD-TW can not only detect concept drift site accurately, but also show good performance in judging the class of concept drift.
As a new type of data, streaming data has been applied in various application fields. Its fast, massive and continuous characteristics make single pass and accurate scanning become essential features of online learning. In the process of continuous generation of streaming data, concept drift often occurs. At present, the research on concept drift detection is relatively mature. However, in reality, the development of learning environment factors in different directions often leads to the diversity of concept drift class in streaming data, which brings new challenges to streaming data mining and online learning. To solve this problem, this paper proposes a concept drift class detection method based on time window (CD-TW). In this method, stack and queue are used to access the data, and window mechanism is used to learn streaming data in chunks. This method detects concept drift site by creating two basic site time windows which load historical data and current data respectively and comparing the distribution changes of the data contained in them. Then, a span time window loading partial data after drift site is created. The drift span is obtained by analyzing the distribution stability of the data in span time window, which is further used to judge the concept drift class. The results of experiment demonstrate that CD-TW can not only detect concept drift site accurately, but also show good performance in judging the class of concept drift.
2022, 59(1): 144-171.
DOI: 10.7544/issn1000-1239.20201042
Abstract:
With the rapid development of data-driven intelligent technologies, large-scale data collection has become a main application scenario of data governance and privacy-preserving. Local differential privacy technology as a mainstream technology has been widely used in companies, such as Google, Apple, and Microsoft. However, this technology has a fatal drawback, which is its poor data utility caused by accumulative noises added to users’ data. To juggle the data privacy and utility, the ESA (encode-shuffle-analyze) framework is proposed. This framework tries adding noises as little as possible while maintaining the same degree of data privacy, which ensures that any user’s sensitive information can be used effectively but cannot be recognized from collected data. Considering the elegant and strict definition of differential privacy in math, the major implementation of the ESA framework is based on differential privacy, named shuffle differential privacy. In the case of the same privacy loss, the data utility of shuffled differential privacy method is O(n\+{1/2}) higher than that of local differential privacy, closing to the central differential privacy but does not rely on a trusted third party. This paper is a survey about this novel privacy-preserving framework. Based on the popular shuffle differential privacy technology, it analyzes this framework, summarizes the theoretical and technical foundations, and compares different privacy-preserving mechanisms under different statistical issues theoretically and experimentally. Finally, this work proposes the challenges of the ESA, and prospects the implementation of non-differential privacy methods over this framework.
With the rapid development of data-driven intelligent technologies, large-scale data collection has become a main application scenario of data governance and privacy-preserving. Local differential privacy technology as a mainstream technology has been widely used in companies, such as Google, Apple, and Microsoft. However, this technology has a fatal drawback, which is its poor data utility caused by accumulative noises added to users’ data. To juggle the data privacy and utility, the ESA (encode-shuffle-analyze) framework is proposed. This framework tries adding noises as little as possible while maintaining the same degree of data privacy, which ensures that any user’s sensitive information can be used effectively but cannot be recognized from collected data. Considering the elegant and strict definition of differential privacy in math, the major implementation of the ESA framework is based on differential privacy, named shuffle differential privacy. In the case of the same privacy loss, the data utility of shuffled differential privacy method is O(n\+{1/2}) higher than that of local differential privacy, closing to the central differential privacy but does not rely on a trusted third party. This paper is a survey about this novel privacy-preserving framework. Based on the popular shuffle differential privacy technology, it analyzes this framework, summarizes the theoretical and technical foundations, and compares different privacy-preserving mechanisms under different statistical issues theoretically and experimentally. Finally, this work proposes the challenges of the ESA, and prospects the implementation of non-differential privacy methods over this framework.
2022, 59(1): 172-181.
DOI: 10.7544/issn1000-1239.20200576
Abstract:
The account book of blockchain is open and transparent to realize the traceability and verifiability of transactions. However, this makes the privacy of blockchain users be an urgent problem. In order to solve the problem of transaction amount and identity exposure of both parties in alliance chain transaction, a privacy protection method of alliance chain based on group signature and homomorphic encryption is proposed. This method can protect the identity of the payee and the privacy of the transaction amount on the premise of meeting the traceability and verifiability of the transaction. In this scheme, the concept of group in group signature is combined with the alliance chain properly and we propose the concept of partial identity anonymity to make the user identity anonymous to other secondary nodes but verifiable to the primary nodes. Then the additive homomorphism property of Paillier homomorphism encryption is used to verify the legitimacy of the transaction and protect the privacy of the transaction amount. A four-step verification method for the main nodes is proposed, and through verifying the group signature, account ownership and the validity of the transaction amount, it realizes the supervision of the main nodes on the legality of the transaction. Through analysis, the scheme can resist tamper attacks and public key replacement attacks, and the transaction legitimacy is verified to be reasonable. Finally, by comparing with other schemes, the calculation cost of this scheme is reasonable.
The account book of blockchain is open and transparent to realize the traceability and verifiability of transactions. However, this makes the privacy of blockchain users be an urgent problem. In order to solve the problem of transaction amount and identity exposure of both parties in alliance chain transaction, a privacy protection method of alliance chain based on group signature and homomorphic encryption is proposed. This method can protect the identity of the payee and the privacy of the transaction amount on the premise of meeting the traceability and verifiability of the transaction. In this scheme, the concept of group in group signature is combined with the alliance chain properly and we propose the concept of partial identity anonymity to make the user identity anonymous to other secondary nodes but verifiable to the primary nodes. Then the additive homomorphism property of Paillier homomorphism encryption is used to verify the legitimacy of the transaction and protect the privacy of the transaction amount. A four-step verification method for the main nodes is proposed, and through verifying the group signature, account ownership and the validity of the transaction amount, it realizes the supervision of the main nodes on the legality of the transaction. Through analysis, the scheme can resist tamper attacks and public key replacement attacks, and the transaction legitimacy is verified to be reasonable. Finally, by comparing with other schemes, the calculation cost of this scheme is reasonable.
2022, 59(1): 182-196.
DOI: 10.7544/issn1000-1239.20200701
Abstract:
Generally, as the attribute dimension of the data set increases, the time cost and noise interference generated by the differential privacy publishing method of high-dimensional data will also increase. Especially for high-dimensional binary data, it is easy to be covered by excessive noise. Therefore, an efficient and low-noise publishing method PrivSCBN(differentially private spectral clustering Bayesian network) is proposed for the issue of privacy publishing of high-dimensional binary data. Firstly, based on Jaccard distance, this method uses a spectral clustering algorithm which satisfies differential privacy to divide the attributes set, and further segments the original data set, so as to achieve dimension reduction. Secondly, based on the idea of dynamic programming and combined with the exponential mechanism, this method uses a fast building Bayesian network algorithm which satisfies differential privacy to construct Bayesian network for each subset after segmentation. Finally, this method uses the value characteristic of conditional probability on binary data to add noise to conditional distribution extracted from Bayesian network, and reduces the noise by controlling the maximum in-degrees of Bayesian network. The efficiency and availability of the PrivSCBN method are verified by experiments on three real high-dimensional binary data sets.
Generally, as the attribute dimension of the data set increases, the time cost and noise interference generated by the differential privacy publishing method of high-dimensional data will also increase. Especially for high-dimensional binary data, it is easy to be covered by excessive noise. Therefore, an efficient and low-noise publishing method PrivSCBN(differentially private spectral clustering Bayesian network) is proposed for the issue of privacy publishing of high-dimensional binary data. Firstly, based on Jaccard distance, this method uses a spectral clustering algorithm which satisfies differential privacy to divide the attributes set, and further segments the original data set, so as to achieve dimension reduction. Secondly, based on the idea of dynamic programming and combined with the exponential mechanism, this method uses a fast building Bayesian network algorithm which satisfies differential privacy to construct Bayesian network for each subset after segmentation. Finally, this method uses the value characteristic of conditional probability on binary data to add noise to conditional distribution extracted from Bayesian network, and reduces the noise by controlling the maximum in-degrees of Bayesian network. The efficiency and availability of the PrivSCBN method are verified by experiments on three real high-dimensional binary data sets.
2022, 59(1): 197-208.
DOI: 10.7544/issn1000-1239.20200492
Abstract:
It is critical to catch and apply the vulnerability fix patches in time to ensure the security of information system. However, it is found that open source software maintainers often silently fix security vulnerabilities. For example, 88% of maintainers delay informing users to fix vulnerabilities in the release notes of new software version, and only 9% of the bug fixes clearly give the corresponding CVE ID, and only 3% of the fixes will actively notify the security service provider in time. In many cases, security engineers can’t directly distinguish vulnerability fixes, bug fixes, and feature patches from the code and log message of patches. As a result, vulnerability fixes can’t be identified and applied by users timely. At the same time, it is costly for users to identify vulnerability fixes from a large number of patch submissions. Taking Linux as an example, this paper presents a method of identifying vulnerability patches automatically. This method defines features for the code and log message from patches, builds machine learning model, and trains to learn classifiers that can distinguish vulnerability patches. Experiments indicate that our approach is effective, which can get 91.3% precision, 92% accuracy, 87.53% recall rate, and reduce the false positive rate to 5.2%.
It is critical to catch and apply the vulnerability fix patches in time to ensure the security of information system. However, it is found that open source software maintainers often silently fix security vulnerabilities. For example, 88% of maintainers delay informing users to fix vulnerabilities in the release notes of new software version, and only 9% of the bug fixes clearly give the corresponding CVE ID, and only 3% of the fixes will actively notify the security service provider in time. In many cases, security engineers can’t directly distinguish vulnerability fixes, bug fixes, and feature patches from the code and log message of patches. As a result, vulnerability fixes can’t be identified and applied by users timely. At the same time, it is costly for users to identify vulnerability fixes from a large number of patch submissions. Taking Linux as an example, this paper presents a method of identifying vulnerability patches automatically. This method defines features for the code and log message from patches, builds machine learning model, and trains to learn classifiers that can distinguish vulnerability patches. Experiments indicate that our approach is effective, which can get 91.3% precision, 92% accuracy, 87.53% recall rate, and reduce the false positive rate to 5.2%.
2022, 59(1): 209-235.
DOI: 10.7544/issn1000-1239.20200778
Abstract:
Firmware comparison is an important branch of binary comparison technology. However, the existing binary comparison technology is not ideal when applied to firmware comparison. Previous studies focused on the optimization of the function representation method, but neglected the design and improvement of filters, which led to mismatches caused by firmware containing isomorphic functions. For this reason, this paper proposes a firmware comparison technology based on community detection algorithms, and applies complex network related theories to the field of binary comparison for the first time. Divide the function in the firmware into several communities through the community detection algorithm, use community matching to realize the filter function, and then find the matching function according to the matching community; In addition, this paper optimizes the function similarity calculation method, and designs the operand similarity calculation method. After the prototype system is implemented, this paper uses 1382 firmware to construct two data sets for experiments to verify the feasibility, analyze the performance of the method in this paper, and determine the reasonable value of each parameter, design the credible matching rate as the evaluation index, and compare the method in this paper and Bindiff. Experiments show that this method can improve the accuracy of Bindiff comparison results by 5% to 11%.
Firmware comparison is an important branch of binary comparison technology. However, the existing binary comparison technology is not ideal when applied to firmware comparison. Previous studies focused on the optimization of the function representation method, but neglected the design and improvement of filters, which led to mismatches caused by firmware containing isomorphic functions. For this reason, this paper proposes a firmware comparison technology based on community detection algorithms, and applies complex network related theories to the field of binary comparison for the first time. Divide the function in the firmware into several communities through the community detection algorithm, use community matching to realize the filter function, and then find the matching function according to the matching community; In addition, this paper optimizes the function similarity calculation method, and designs the operand similarity calculation method. After the prototype system is implemented, this paper uses 1382 firmware to construct two data sets for experiments to verify the feasibility, analyze the performance of the method in this paper, and determine the reasonable value of each parameter, design the credible matching rate as the evaluation index, and compare the method in this paper and Bindiff. Experiments show that this method can improve the accuracy of Bindiff comparison results by 5% to 11%.
2022, 59(1): 236-250.
DOI: 10.7544/issn1000-1239.20200586
Abstract:
Reusing existing high-quality source codes can improve efficiency of software development and quality of software. At present, code search based on inputoutput queries provided by users is one of the main approaches in the field of code semantic search, but existing approaches are difficult to describe the complete behavior of codes and can only handle a single input type. This paper proposes a code semantic search approach based on the reachability analysis of Petri Nets for matching multiple forms of type. First, the semantic processes of code snippets consisting of the number of data objects and types of data objects in the code corpus are converted into improved Petri Net models. Second, the initial marking and target marking of Petri Net models are constructed according to the number of data objects and types of data objects contained in users’ queries. Matching code snippets is obtained by the analysis of reachable paths in reachability graphs and induced networks of Petri Nets. Analysis and experimental results show that this approach contributes to seeking out desired code snippets by queries that possess multiple forms of inputoutput types provided by users, and compared with traditional approaches, it can significantly improve accuracy and efficiency of code search.
Reusing existing high-quality source codes can improve efficiency of software development and quality of software. At present, code search based on inputoutput queries provided by users is one of the main approaches in the field of code semantic search, but existing approaches are difficult to describe the complete behavior of codes and can only handle a single input type. This paper proposes a code semantic search approach based on the reachability analysis of Petri Nets for matching multiple forms of type. First, the semantic processes of code snippets consisting of the number of data objects and types of data objects in the code corpus are converted into improved Petri Net models. Second, the initial marking and target marking of Petri Net models are constructed according to the number of data objects and types of data objects contained in users’ queries. Matching code snippets is obtained by the analysis of reachable paths in reachability graphs and induced networks of Petri Nets. Analysis and experimental results show that this approach contributes to seeking out desired code snippets by queries that possess multiple forms of inputoutput types provided by users, and compared with traditional approaches, it can significantly improve accuracy and efficiency of code search.