ISSN 1000-1239 CN 11-1777/TP

Table of Content

01 February 2020, Volume 57 Issue 2
A Write-Optimized Re-computation Scheme for Non-Volatile Memory
Zhang Ming, Hua Yu, Liu Lurong, Hu Rong, Li Ziyi
2020, 57(2):  243-256.  doi:10.7544/issn1000-1239.2020.20190524
Asbtract ( 583 )   HTML ( 14)   PDF (2602KB) ( 410 )  
Related Articles | Metrics
The rise of non-volatile memory (NVM) technology brings many opportunities and challenges to computer storage systems. Compared with DRAM, NVM has the advantages of non-volatility, low energy consumption and high storage density as persistent memory. However, it has the disadvantages of limited erase/write times and high write latency. Therefore, it is necessary to reduce the write operations to the non-volatile main memory to improve system performance and extend the lifetime of NVM. To address this problem, this paper proposes ROD, which is a re-computation scheme based on the out degree of computing nodes. Due to the performance gap between CPU and memory leading to computing resources waste, ROD selectively discards the computed results which should be stored into the NVM and re-computes them when needed. By swapping the storage with computation, ROD reduces the number of writes to the NVM. We have implemented ROD and evaluated it on Gem5 simulator with NVMain. We also implemented the greedy re-computation scheme and the store-only scheme as state-of-the-art work to be compared with ROD. Experimental results from powerstone benchmark show that ROD reduces write operations by 44.3% on average (up to 68.5%) compared with the store-only scheme. The execution time is reduced by 28.1% on average (up to 68.6%) compared with the store-only scheme, and 9.3% on average (up to 19.4%) compared with the greedy scheme.
A High Throughput NVM Storage System Based on Access Request Conflict Detection
Cai Tao, Wang Jie, Niu Dejiao, Liu Peiyao, Chen Fuli
2020, 57(2):  257-268.  doi:10.7544/issn1000-1239.2020.20190526
Asbtract ( 448 )   HTML ( 12)   PDF (2727KB) ( 217 )  
Related Articles | Metrics
The NVM storage is a useful way to improve the efficiency of storage system in computer. However, there is a lack of adaptation and optimization mechanisms for NVM storage devices in I/O stack of operating system. Especially, the file system-based lock mechanism becomes an important factor affecting the efficiency of NVM storage systems. In this paper, we embed the management function of the access request for storage system in the NVM storage device. In order to improve the access request concurrency of operating system and alleviate the performance bottleneck caused by the device interface, we remove the existing lock mechanism in file system and use the conflict detection algorithm in NVM storage devices. Firstly, the structure of the high-throughput NVM storage system is given. Then, an access request management method based on two-dimensional linked list is design to change the structure of access request management and reduce the conflict of management. And a conflict detection algorithm is design to manage the access requests for the sharing data stored in NVM storage device. The new submission and release process for access requests are designed. Finally, the prototype of high throughput NVM storage system named HTPM is implemented based on PMEM, which is the open source NVM storage device simulator from Intel. The I/O performance and throughput of HTPM are tested by Fio and Filebench. The results show that HTPM can improve the IOPS by 31.9% and the I/O performance by 21.4% compared with PMEM.
Fingerprint Search Optimization for Deduplication on Emerging Storage Devices
He Kewen, Zhang Jiachen, Liu Xiaoguang, Wang Gang
2020, 57(2):  269-280.  doi:10.7544/issn1000-1239.2020.20190543
Asbtract ( 418 )   HTML ( 9)   PDF (1942KB) ( 256 )  
Related Articles | Metrics
Fingerprint search part is I/O intensive, and the performance of the external storage device is the bottleneck of fingerprint search. Therefore, this paper focuses on the fingerprint search of data deduplication system. This paper compares the traditional eager deduplication algorithm with lazy deduplication algorithms that reduce the number of disk accesses, and studies deduplication algorithm on the emerging storage devices: Optane SSD and persistent memory, and gives optimization suggestions. In this paper, we model the fingerprint search delay of the eager deduplication algorithm and the lazy deduplication algorithm, and three conclusions under the new storage device are obtained through the modeling results: 1) The number of fingerprints for batched search should be reduced; 2) The local ring size should be reduced on faster devices, and the local loop size has an optimal value; 3) On fast devices, the eager fingerprint lookup is better than the lazy fingerprint lookup. Finally, the experimental results verify the correctness of our model on the actual HDD, Optane SSD and emulated persistent memory. The eager algorithm is better than the lazy algorithm on the emerging storage devices, and the locality ring optimal value is advanced, which basically conforms to the conclusions of the proposed model.
A Hybrid Approach for Managing Data Pages in Persistent Memory File Systems
Chen Youmin, Zhu Bohong, Han Yinjun, Tu Yaofeng, Shu Jiwu
2020, 57(2):  281-290.  doi:10.7544/issn1000-1239.2020.20190574
Asbtract ( 439 )   HTML ( 12)   PDF (1047KB) ( 235 )  
Related Articles | Metrics
Intel has officially released the Optane DC Persistent Memory based on 3D-Xpoint technology in April 2019, which provides new opportunities for building efficient persistent memory storage systems. However, existing software is far from fully exploiting the hardware performance, due to the ignorance of utilizing the byte-addressable feature of persistent memory. This paper proposes a hybrid data page management (HDPM) mechanism. It manages file data by selectively using the copy-on-write technique and log-structure, so as to fully utilize the byte-addressable feature of persistent memory. It can avoid the redundant copy overhead as in traditional approaches when processing un-aligned or small-sized writes. To guarantee the read performance unaffected, HDPM introduces reverse-scanning mechanism, which avoids the additional data copying when rebuilding data pages from the log. HDPM also introduces a multi-stage garbage collection mechanism for log cleaning. When a single log is too large, it’s automatically reclaimed by read/write system calls. When the persistent memory space is limited, a background thread asynchronously reclaims the log space with a lock-free approach, without affecting the normal read/write performance. Experiments show that HDPM provides high write performance. Compared with NOVA, a state-of-the-art persistent memory file system, HDPM exhibits 58% lower write latency at most with the small-sized and write-intensive workload, and provides comparable performance for read operations. Our evaluation with Filebench shows that HDPM outperforms NOVA by 33% at most with 10 concurrent threads.
A Cross-Datacenter Erasure Code Writing Method Based on Generator Matrix Transformation
Bao Han, Wang Yijie, Xu Fangliang
2020, 57(2):  291-305.  doi:10.7544/issn1000-1239.2020.20190542
Asbtract ( 297 )   HTML ( 5)   PDF (4535KB) ( 150 )  
Related Articles | Metrics
In cross-datacenter storage systems, existing writing methods of erasure code usually has low encoding efficiency, low transmission efficiency, and large network resource consumption. Therefore, cross-datacenters erasure code usually has a low writing rate. This paper proposes a cross-datacenter erasure code writing method based on generator matrix transformation called CREW. Specifically, we first propose a greedy strategy-based transmission topology construction algorithm called GBTC, which can construct a tree-structured transmission topology with incremental weights (the weights are set to the network distances between datacenters) from top to bottom to organize data transmission between datacenters. Then, we propose a generator matrix transformation algorithm called GMT. Without changing the linear relationship of coded blocks, GMT can transform the generator matrix so that the number of data blocks related to a coded block is negatively correlated with the network distance between the datacenter where the coded block is located and the root of the tree-structured topology. Therefore, CREW only needs to transfer a small number of data blocks through a long network distance to write data. Thus, the network resource consumption is reduced. Finally, we propose a distributed pipelined writing algorithm called DPW to distribute encoding operations to different nodes for parallel execution and limit the number of forwards of data blocks, thereby improving encoding efficiency and transmission efficiency. Experiments show that compared with writing methods of traditional erasure code, the write rate of CREW is increased by 36.3%~57.9%. And compared with the existing writing method of cross-datacenter erasure code (IncEncoding), the writing rate of CREW is increased by 32.4%.
Proactive Fault Tolerance Based on “Collection—Prediction—Migration—Feedback” Mechanism
Yang Hongzhang, Yang Yahui, Tu Yaofeng, Sun Guangyu, Wu Zhonghai
2020, 57(2):  306-317.  doi:10.7544/issn1000-1239.2020.20190549
Asbtract ( 654 )   HTML ( 24)   PDF (1574KB) ( 281 )  
Related Articles | Metrics
Hard disk fault has become the main source of failure in data centers, which seriously affects the reliability of data. The traditional data fault tolerant technology is usually realized by increasing data redundancy, which has some shortcomings. Proactive fault tolerant technology has become a research hotspot, because it can predict hard disk failures and migrate data ahead of time. However, the existing technology mostly studies hard disk fault prediction, but lacks the research of collection, migration and feedback, which causes difficulty in commercialize. This paper proposes a whole process proactive fault tolerant on “Collection—Prediction—Migration—Feedback” mechanism, which includes time-sharing hard disk information collection method, sliding window record merging and sample building method, multi-type hard disk fault prediction method, multi-disk joint data migration method, and two-level validation of prediction results with fast feedback method. The test results show that the impact of collecting hard disk information on front-end thread is only 0.96%, the recall rate of hard disk fault prediction is 94.66%, and data repair time is 55.10% less than traditional methods. This work has been used stably in ZTE’s data center, which meets the objectives of proactive fault tolerance technology, such as high-reliability, high-intelligence, low-interference, low-cost and wide-application.
A Benefit Model Based Data Reuse Mechanism for Spark SQL
Shen Yijie, Zeng Dan, Xiong Jin
2020, 57(2):  318-332.  doi:10.7544/issn1000-1239.2020.20190563
Asbtract ( 389 )   HTML ( 12)   PDF (3529KB) ( 199 )  
Related Articles | Metrics
Analyzing massive data to discover the potential values in them can bring great benefits. Spark is a widely used data analytics engine for large-scale data processing due to its good scalability and high performance. Spark SQL is the most commonly used programming interface for Spark. There are a lot of redundant computations in data analytic applications. Such redundancies not only waste system resources but also prolong the execution time of queries. However, current implementation of Spark SQL is not aware of redundant computations among data analytic queries, and hence cannot remove them. To address this issue, we present a benefit model based, fine-grained, automatic data reuse mechanism called Criss in this paper. Criss automatically identifies redundant computations among queries. Then it uses an I/O performance aware benefit model to automatically choose the operator results with the biggest benefit and cache these results using a hybrid storage consisting of both memory and HDD. Moreover, cache management and data reuse in Criss are partition-based instead of the whole result of an operator. Such fine-grained mechanism greatly improves query performance and storage utilization. We implement Criss in Spark SQL using modified TachyonFS for data caching. Our experiment results show that Criss outperforms Spark SQL by 40% to 68%.
Efficient Index and Query Algorithm Based on Geospatial Big Data
Zhao Huihui, Zhao Fan, Chen Renhai, Feng Zhiyong
2020, 57(2):  333-345.  doi:10.7544/issn1000-1239.2020.20190565
Asbtract ( 568 )   HTML ( 23)   PDF (4069KB) ( 439 )  
Related Articles | Metrics
In recent years, with the rapid development of advanced technologies such as intelligent target recognition, electronic sensors, collaborative control and computer networks, intelligent transportation systems have achieved qualitative leapfrogging. Modern intelligent transportation systems can realize intelligent transportation of vehicles, roads and clouds management platform. However, the intelligent transportation system relies on a large amount of two-dimensional geospatial information data generated every day. Therefore, how to efficiently store and query large-scale geospatial data is of great significance for the future popularization and development of the intelligent transportation system. However, due to the complexity of urban traffic information, large amount of data, and fast update speed, the current spatial indexing technology is difficult to efficiently search for two-dimensional geospatial information data. In order to optimize the storage organization structure of two-dimensional geospatial information data under spatial big data and improve retrieval efficiency, this paper proposes a spatial index tree construction algorithm for multi-layer slice recursion of two-dimensional geospatial information data (multi-layer slice recursive, MSR). The proposed algorithm first sorts and divides the first dimension of the map data to generate FD slices. Then, the second dimension of the map data in the FD slice is sorted to generate SD slices, and in the SD slice, the current slice and the adjacent slices are divided into spatial objects. Finally, data clustering operation is performed on the comparison between the length of the spatial object and the node capacity, and the MSR Tree is recursively generated from the bottom up by judging whether all the slices complete the clustering operation. Experimental results show that the query performance of the 2-dimensional space storage structure constructed by the MSR algorithm is better than the most representative spatial indexing technology based on the R-tree batch-loading algorithm (sort tile recursive, STR), STR-grid hybrid algorithm (str-grid), and efficient geometric range query (EGRQ).
Survey on Privacy-Preserving Machine Learning
Liu Junxu, Meng Xiaofeng
2020, 57(2):  346-362.  doi:10.7544/issn1000-1239.2020.20190455
Asbtract ( 3186 )   HTML ( 141)   PDF (1684KB) ( 3278 )  
Related Articles | Metrics
Large-scale data collection has vastly improved the performance of machine learning, and achieved a win-win situation for both economic and social benefits, while personal privacy preservation is facing new and greater risks and crises. In this paper, we summarize the privacy issues in machine learning and the existing work on privacy-preserving machine learning. We respectively discuss two settings of the model training process—centralized learning and federated learning. The former needs to collect all the user data before training. Although this setting is easy to deploy, it still exists enormous privacy and security hidden troubles. The latter achieves that massive devices can collaborate to train a global model while keeping their data in local. As it is currently in the early stage of the study, it also has many problems to be solved. The existing work on privacy-preserving techniques can be concluded into two main clues—the encryption method including homomorphic encryption and secure multi-party computing and the perturbation method represented by differential privacy, each having its advantages and disadvantages. In this paper, we first focus on the design of differentially-private machine learning algorithm, especially under centralized setting, and discuss the differences between traditional machine learning models and deep learning models. Then, we summarize the problems existing in the current federated learning study. Finally, we propose the main challenges in the future work and point out the connection among privacy protection, model interpretation and data transparency.
Weighted Large-Scale Social Network Data Privacy Protection Method
Huang Haiping, Zhang Dongjun, Wang Kai, Zhu Yikai, Wang Ruchuan
2020, 57(2):  363-377.  doi:10.7544/issn1000-1239.2020.20190018
Asbtract ( 746 )   HTML ( 7)   PDF (6285KB) ( 256 )  
Related Articles | Metrics
The development of various social network applications inspires the emergence of vast amounts of users, which forms a large-scale social graph structure. Amounts of private data are involved in such a graph structure, which should be preserved before being released in order to prevent privacy leakage. And meanwhile, the intricate social relationships between users are not equivalent, and the sensitivity of individual relationships may directly affect the distribution of privacy and the effect of protection. Currently, there are many privacy protection methods for social network graphs without weights, however these methods cannot be directly applied to the scenarios with weighted values (i.e. uneven sensitivity of social relations). To solve this problem, we propose a weighted social network graph perturbation method based on non-interactive differential privacy protection model, named as dp-noisy, which can achieve strong protection of edge weights and graph structure. This method adds disturbance noise based on the single-source shortest path constraint model, classifies the critical edges and non-critical edges according to the weights, and effectively reduces the edge links that need to be disturbed. The experimental results show that in a large-scale dataset (the number of nodes is 30 000), the execution efficiency of the proposed method is improved by 47.3% than that of K-MPNP(K-shortest path privacy), 41.8% than that of LWSPA(protection algorithm based on Laplace noise for weighted social networks) and 52.6% than that of DER(density-based exploration and reconstruction). With similar degree of data privacy protection, the data availability of dp-noisy is 10% higher than that of lp-noisy, superior to that of DER and slightly better than that of LWSPA. The average perturbation quality of dp-noisy is 14% higher than that of lp-noisy, 11.3% higher than that of DER and 27% higher than that of K-MPNP. In the case of the best data availability (ε=10), the average perturbation quality of dp-noisy is 6% higher than that of LWSPA. Compared with the classical methods, our proposal achieves the satisfactory execution efficiency and data availability, which can resist graph structure attack, so it is suitable for large-scale social network applications.
Survey on Density Peak Clustering Algorithm
Chen Yewang, Shen Lianlian, Zhong Caiming, Wang Tian, Chen Yi, Du Jixiang
2020, 57(2):  378-394.  doi:10.7544/issn1000-1239.2020.20190104
Asbtract ( 985 )   HTML ( 37)   PDF (1845KB) ( 545 )  
Related Articles | Metrics
DPeak(density peak) is a simple but effective clustering method. It is able to map data with arbitrary dimension onto a 2-dimensional space, and construct hierarchical relationship for all data points on the new reduction space. This makes it is easy to pick up some distinguished points (density peaks), each of which has high density and large distance from other regions of higher density. In addition, based on regarding theses density peaks as cluster centers and the hierarchical relationship, the algorithm provides two different ways to perform the final task of clustering, i.e., one is decision diagram that can interact with users, and the other is an automatic method. In this paper, we trace the development and application trends of DPeak in recent years, summarize and comb various improvements or variations of DPeak algorithm from the following aspects. Firstly, the principle of DPeak algorithm is introduced, and its position in the classification system of clustering algorithm is discussed as well. After comparing DPeak with several other main clustering algorithms, it is found that DPeak is highly similar to mean shift, and hence, we think that DPeak may be a special variant of mean shift. Secondly, some shortcomings of DPeak are discussed, such as high time complexity, lack of adaptability, low precision and inefficiency in high dimensional space etc., and then various improved algorithms are demonstrated in different categories. In addition, some applications of DPeak in different fields, such as natural language processing, biomedical analysis and optical applications etc., are presented and combed. Last but not least, we look forward to its future work based on the problems and challenges of the DPeak.
A Benchmark for Iris Segmentation
Wang Caiyong, Sun Zhenan
2020, 57(2):  395-412.  doi:10.7544/issn1000-1239.2020.20190092
Asbtract ( 607 )   HTML ( 24)   PDF (3391KB) ( 262 )  
Related Articles | Metrics
Iris recognition has been considered as one of the most stable and reliable biometric identification technologies. In the whole process of iris recognition, iris segmentation is in the preprocessing stage, so the quality of iris segmentation can directly affect the accuracy of iris recognition. Since Daugman first proposed a high-performance iris recognition system in 1993, a variety of iris segmentation algorithms have been proposed. Especially in recent years, deep learning based iris segmentation algorithms have greatly improved the accuracy of iris segmentation. However, due to the lack of unified databases and evaluation protocols, the comparisons of different iris segmentation algorithms are messy and absence of fairness, hence an open iris segmentation evaluation benchmark is proposed. First, the definition and challenges of iris segmentation are briefly introduced. Second, a comprehensive summary of three representative, open iris segmentation databases including the characteristics and challenges is given. Next, evaluation protocols of iris segmentation are defined. Then the traditional iris segmentation algorithms and deep learning based iris segmentation algorithms are elaborated on and compared by extensive experiments. The experimental results show that deep learning based iris segmentation methods outperform the traditional approaches in terms of accuracy. Finally, we make in-depth discussions about open questions of deep learning based iris segmentation algorithms.
Multichannel Spectral-Spatial Total Variation Model for Diffractive Spectral Image Restoration
Wang Xu, Chen Qiang, Sun Quansen
2020, 57(2):  413-423.  doi:10.7544/issn1000-1239.2020.20190333
Asbtract ( 344 )   HTML ( 3)   PDF (3870KB) ( 117 )  
Related Articles | Metrics
During the imaging process of the diffractive optic imaging spectrometer, infocus images are often blurred by other defocused images. The existing recovery algorithms only utilize the spatial information and show limitations on these ill-posed inverse problems. In this paper, a regularization method based on the multichannel spectral-spatial total variation prior is proposed to reconstruct diffractive spectral images. First, an observation model of the degraded spectral images is constructed carefully, relying on the principle of diffractive spectral imaging. Then, a reconstruction model is established by combining the spatial and spectral prior information under the maximum posteriori framework. The proposed model makes full use of the local spatial smoothness and local spectral smoothness of the diffractive spectral images. Meanwhile, the ADMM (alternating direction method of multipliers) is employed to efficiently optimize the model. A large number of experiments demonstrate that this new restoration model has better performance in terms of average peak signal-to-noise ratio, average structural similarity, average spectral angular distance and visual quality compared with other restoration methods. In addition, for the ill-conditioned problem with cross-channel blurs and noise interference, this model can suppress noise, preserve edge information, and reduce jagged spectral distortion while ensuring the solution speed.
Association Learning: A New Perspective of Mining Association
Qian Yuhua, Zhang Mingxing, Cheng Honghong
2020, 57(2):  424-432.  doi:10.7544/issn1000-1239.2020.20190281
Asbtract ( 668 )   HTML ( 14)   PDF (2892KB) ( 327 )  
Related Articles | Metrics
Discovering associations is an important task in big data mining and analysis. Most of the existing mining methods just summarize the associations among data statistically, and cannot learn experience from known data as well as generalize to unseen instances. This paper attempts to explore the associations from learning perspective, and some formal definitions of association learning and relative model concepts are proposed. According to the definitions, a learning data set, namely, the two-class associated image data sets (TAID) are constructed. Then three association discriminators are designed, where associated image convolutional neural network discriminator (AICNN) and associated image LeNet discriminator (AILeNet) are end-to-end learning using softmax function for discrimination, associated image K-nearest neighbor discriminator (AIKNN) based on the associated features extracted by convolutional neural network adopts the K-nearest neighbor algorithm for discrimination. Furthermore, these discriminators are tested on the TAID. The discriminant accuracy of AICNN on an image training set of 90 000 samples and 64×64 size is 0.821 7; AILeNet and AIKNN on 22 500 256×256 images are 0.845 6 and 0.866 4 respectively. These three experiments effectively demonstrate the feasibility of learning the associations in data.
A DGA Domain Name Detection Method Based on Deep Learning Models with Mixed Word Embedding
Du Peng, Ding Shifei
2020, 57(2):  433-446.  doi:10.7544/issn1000-1239.2020.20190160
Asbtract ( 498 )   HTML ( 19)   PDF (1367KB) ( 383 )  
Related Articles | Metrics
DGA domain name detection plays a key role in preventing botnet attacks. It is practically significant in generating threat intelligence, blocking botnet command and control traffic, and maintaining cyber security. In recent years, DGA domain name detection algorithms have made great progress, from the methods using manually-crafted features to the automatically extracting features generated by deep learning methods. Multiple studies have indicated that deep learning methods perform better in DGA detection. However, DGA families are various and domain name data is imbalanced in the multi-class classification of different DGA families. Many existing deep learning models can still be improved. To solve the above problems, a mixed word embedding method is designed, based on character level embedding and bigram level embedding, to improve the information utilization of domain names. The paper also designs a deep learning model using the mixed word embedding method. At the end of the paper, an experiment with multiple comparison models is conducted to test the model. The experimental results show that the model based on the mixed word embedding achieves better performance in DGA domain name detection and multi-class classification tasks compared with the models based on character level embedding, especially in the small DGA families with few samples. The results show the proposed approach is effective.
Meso-Granularity Labeled Method for Multi-Granularity Formal Concept Analysis
Li Jinhai, Li Yufei, Mi Yunlong, Wu Weizhi
2020, 57(2):  447-458.  doi:10.7544/issn1000-1239.2020.20190279
Asbtract ( 346 )   HTML ( 7)   PDF (792KB) ( 171 )  
Related Articles | Metrics
For the existing multi-granularity labeled formal contexts, the granular labeled values of all the attributes are organized by the union of some single-granularity labeled formal contexts. This may lead to the result that the subsequent related research problems mainly concern knowledge discovery on these single-granularity labeled formal contexts as well as their internal relationships. Consequently, it will not be beneficial for mining multilayer knowledge from multi-granularity labeled formal contexts. In this paper, we discuss meso-granularity labeled formal contexts in multi-granularity labeled formal contexts by restructuring granular labeled values of attributes of the original single-granularity labeled formal contexts, which makes knowledge discovery not limited to coarse and fine granular labeled data formed by data acquisition or representation, but absorbed from the combined data through cross-granularity. Firstly, we give the notion of a meso-granularity labeled formal context and its corresponding semantic interpretation. Secondly, we investigate generalization and specialization of meso-granularity labeled formal contexts, and prove that all the meso-granularity labeled formal contexts form a complete lattice under the generalization-specialization relation. Thirdly, we put forward a meso-granularity based knowledge discovery method for multi-granularity labeled formal decision contexts, and clarify the inference relationship between decision implications extracted from the coarse and fine meso-granularity labeled formal contexts. Finally, our experimental analysis demonstrates the effectiveness and advantages of the proposed meso-granularity labeled methods.