ISSN 1000-1239 CN 11-1777/TP

Table of Content

01 February 2021, Volume 58 Issue 2
Blockchain-Based Data Transparency: Issues and Challenges
Meng Xiaofeng, Liu Lixin
2021, 58(2):  237-252.  doi:10.7544/issn1000-1239.2021.20200017
Asbtract ( 491 )   PDF (1812KB) ( 499 )  
Related Articles | Metrics
With the high-speed development of Internet of things, wearable devices and mobile communication technology, large-scale data continuously generate and converge to multiple data collectors, which influences people’s life in many ways. Meanwhile, it also causes more and more severe privacy leaks. Traditional privacy aware mechanisms such as differential privacy, encryption and anonymization are not enough to deal with the serious situation. What is more, the data convergence leads to data monopoly which hinders the realization of the big data value seriously. Besides, tampered data, single point failure in data quality management and so on may cause untrustworthy data-driven decision-making. How to use big data correctly has become an important issue. For those reasons, we propose the data transparency, aiming to provide solution for the correct use of big data. Blockchain originated from digital currency has the characteristics of decentralization, transparency and immutability, and it provides an accountable and secure solution for data transparency. In this paper, we first propose the definition and research dimension of the data transparency from the perspective of big data life cycle, and we also analyze and summary the methods to realize data transparency. Then, we summary the research progress of blockchain-based data transparency. Finally, we analyze the challenges that may arise in the process of blockchain-based data transparency.
Ethical Behavior Discrimination Based on Social News Dataset
Gu Tianlong, Feng Xuan, Li Long, Bao Xuguang, Li Yunhui
2021, 58(2):  253-263.  doi:10.7544/issn1000-1239.2021.20200727
Asbtract ( 213 )   PDF (979KB) ( 200 )  
Related Articles | Metrics
With the broader applications of artificial intelligence (AI), their ethical and moral issues have attracted more and more concerns. How to develop an AI system that complies with human values and ethical norms from the perspective of technology realization, namely, ethical aligned AI design, is one of the important issues that need to be solved urgently. The ethical and moral discrimination based on machine learning is a beneficial exploration in this aspect. Social news data has rich content and knowledge of ethics and morality, which provides the possibility for the training data development of machine learning. Because of this, this paper constructs a social news dataset with ethics and morality of human behavior, which is attached to law and code of conduct dataset for machine learning training and testing. The ethical behavior discrimination model ERNIE-CNN based on enhanced language representation of information entities (ERNIE) and convolutional neural network (CNN), is developed to extract ethical discriminations about behavior by calculating semantic similarity based on the vector representation of words. The experimental results show that the proposed model has better performance than the baseline models.
Fairness Research on Deep Learning
Chen Jinyin, Chen Yipeng, Chen Yiming, Zheng Haibin, Ji Shouling, Shi Jie, Cheng Yao
2021, 58(2):  264-280.  doi:10.7544/issn1000-1239.2021.20200758
Asbtract ( 324 )   PDF (1752KB) ( 265 )  
Related Articles | Metrics
Deep learning is an important field of machine learning research, which is widely used in industry for its powerful feature extraction capabilities and advanced performance in many applications. However, due to the bias in training data labeling and model design, research shows that deep learning may aggravate human bias and discrimination in some applications, which results in unfairness during the decision-making process, thereby will cause negative impact to both individuals and socials. To improve the reliability of deep learning and promote its development in the field of fairness, we review the sources of bias in deep learning, debiasing methods for different types biases, fairness measure metrics for measuring the effect of debiasing, and current popular debiasing platforms, based on the existing research work. In the end we explore the open issues in existing fairness research field and future development trends.
A Secure Multi-Party Computation Protocol for Universal Data Privacy Protection Based on Blockchain
Liu Feng, Yang Jie, Li Zhibin, Qi Jiayin
2021, 58(2):  281-290.  doi:10.7544/issn1000-1239.2021.20200751
Asbtract ( 263 )   PDF (1496KB) ( 271 )  
Related Articles | Metrics
Recent years, how to protect user privacy data on the blockchain reasonably and efficiently is a key issue in the current blockchain technology field. Based on this, in this paper, a secure multi-party computation protocol is designed based on the Pedersen commitment and Schnorr protocol (protocol of blockchain based on Pedersen commitment linked schnorr protocol for multi-party computation, BPLSM). Through constructing the structure of the protocol and carrying out formal proof calculations, it is confirmed that the protocol can be integrated into the blockchain network to merge different private messages for efficient signing under anonymity. In addition, by analyzing the nature and security of the protocol, it can be proved that the overhead about computation of the general-purpose privacy computing scheme using the BPLSM protocol on the blockchain is low, and it also has strong information imperceptibility. In the end, experimental simulation results show that the time cost of BPLSM protocol verification in a small-scale multi-party transaction with a fixed number of people is about 83.5% lower than that of the current mainstream BLS signature.
Multi-Replica Cloud Data Storage Based on Hierarchical Network Coding
Xu Guangwei, Shi Chunhong, Feng Xiangyang, Luo Xin, Shi Xiujin, Han Songhua, Li Wei
2021, 58(2):  293-304.  doi:10.7544/issn1000-1239.2021.20200340
Asbtract ( 80 )   PDF (2658KB) ( 42 )  
Related Articles | Metrics
The rapid development of cloud data storage presents a high demand on the availability of stored data. Currently, the main technique of ensuring data availability is to use erasure coding to calculate coded blocks for the stored data, and then utilize distributed storage to store multiple redundant coded blocks in cloud storage space. Although this data coding technology can ensure the security of stored data and reduce extra storage space, it also causes a large calculation and communication overhead when recovering corrupted data. In this paper a multi-replica generation and corrupted data recovery algorithm is proposed based on hierarchical network coding. The algorithm improves the coding matrix of erasure coding based on hierarchical network coding to form the hierarchical coding (HC). Then multi-replicas which are built based on the cascade of the hierarchical coding forms the coding relationship between each other. In the process of corrupted data recovery, the data encoding information provided by the data owner and the complete data blocks stored by the cloud server are jointly computed to recover the corrupted data blocks, avoiding remote data downloading from the cloud storage space. Theoretical analysis and simulation experiments indicate that the proposed algorithm reduces the communication overhead significantly when recovering corrupted data and improves the availability of stored data under the same storage space.
An Ant Colony Optimization Algorithms Based Data Update Scheme for Erasure-Coded Storage Systems
Li Qian, Hu Yupeng, Ye Zhenyu, Xiao Ye, Qin Zheng
2021, 58(2):  305-318.  doi:10.7544/issn1000-1239.2021.20200383
Asbtract ( 68 )   PDF (4536KB) ( 60 )  
Related Articles | Metrics
Owing to the high availability and space-efficiency of erasure codes, they have become the de facto standard to provide data durability in large scale distributed storage systems. The update intensive workloads of erasure codes lead to a large amount of data transmission and I/O cost. As a result, it becomes a major challenge to reduce the amount of data transmission and optimize the use of existing network resources so that the update efficiency of the erasure codes could be improved. However, very little research has been done to optimize the update efficiency of the erasure codes under multiple QoS(quality of service) metrics. In this paper, the proposed update scheme, the ACOUS (ant colony optimization algorithm based multiple data nodes update scheme) employs a two-stage rendezvous data update procedure to optimize the multiple data nodes updates. Specifically, the two-stage rendezvous data update procedure performs the data delta collection and the parity delta distribution efficiently, based on a multi-objective update tree which is built by the MACOU(multi-objective ant colony optimization update routing algorithm). Under typical data center network topologies, extensive experimental results show that, compared with the traditional TA-Update scheme, the proposed scheme is able to achieve 26% to 37% reduction of update delay with convergence guarantee at the cost of negligible computation overhead.
Node-Constraint Store-and-Forward Scheduling Method for Inter-Datacenter Networks
Lin Xiao, Ji Shuo, Yue Shengnan, Sun Weiqiang, Hu Weisheng
2021, 58(2):  319-337.  doi:10.7544/issn1000-1239.2021.20200384
Asbtract ( 68 )   PDF (6089KB) ( 54 )  
Related Articles | Metrics
Performing store-and-forward (SnF) using abundant storage resources inside datacenters has been proven to be effective in overcoming the challenges faced by inter-datacenter bulk transfers. Most prior studies attempt to fully leverage the network infrastructure and maximize the flexibility of the SnF scheme. Their proposed scheduling methods hence aim at a full storage placement where all network nodes (e.g., datacenters) are SnF-enabled and every node is taken into account in the scheduling process. However, the computational complexity of the prior methods exponentially increases with the network scale. As a result, the prior methods may become too complicated to implement for large-scale networks and online scheduling. In this work, based on the inter-datacenter optical network, SnF models are presented to quantify the impact of the number of SnF-enabled nodes on the performance and the complexity of the SnF scheduling problem. Our key findings show that taking a few SnF-enabled nodes into account in the scheduling process can provide high performance while maintaining low complexity under certain circumstances. It is unnecessary to take every node into account in the scheduling process. Therefore, a node-constraint SnF scheduling method is proposed, whose features are twofold: 1) by taking a portion of nodes into account, it reduces the complexity of the SnF scheduling problem; 2) by introducing a topology abstraction, it condenses the link states between the considered nodes and hence reduces the problem size, which improves its efficiency in solving the SnF scheduling problem. Simulations demonstrate that the proposed method outperforms the prior method in terms of blocking probability and computation time.
Content Sifting Storage Mechanism for Cross-Modal Image and Text Data Based on Semantic Similarity
Liu Yu, Guo Chan, Feng Shuyao, Zhou Ke, Xiao Zhili
2021, 58(2):  338-355.  doi:10.7544/issn1000-1239.2021.20200388
Asbtract ( 54 )   PDF (5307KB) ( 52 )  
Related Articles | Metrics
With the explosive growth of multimedia data, the data in cloud becomes heterogeneous and large. The conventional storage systems served for data analysis face the challenge of long read latency due to the lack of semantic management of data. To solve this problem, a cross-modal image and text content sifting storage(CITCSS) mechanism is proposed, which saves the read bandwidth by only reading relevant data. The mechanism consists of the off-line and on-line stages. In the off-line stage, the system first uses the self-supervised adversarial Hash learning algorithm to learn and map the stored data to similar Hash codes. Then, these Hash codes are connected by Hamming distances and managed by the metadata style. In the implement, we use Neo4j to construct the semantic Hash code graph. Furthermore, we insert storage paths into the property of node to accelerate reading. In the on-line stage, our mechanism first maps the image or text represented the analysis requirement into Hash codes and sends them to the semantic Hash code graph. Then, the relevant data will be found by the sifting radius on the graph, and returned to the user finally. Benefiting from our mechanism, storage systems can perceive and manage semantic information resulting in advance service for analysis. Experimental results on public cross-modal datasets show that CITCSS can greatly reduce the read latency by 99.07% to 99.77% with more than 98% recall rate compared with conventional semantic storage systems.
Rethinking Index Design Based on Persistent Memory Device
Han Shukai, Xiong Ziwei, Jiang Dejun, Xiong Jin
2021, 58(2):  356-370.  doi:10.7544/issn1000-1239.2021.20200394
Asbtract ( 60 )   PDF (4151KB) ( 70 )  
Related Articles | Metrics
NVM (non-volatile memory) is a new type of storage medium that has emerged in recent years. On the one hand, similar to DRAM (Dynamic RAM), NVM has low access latency and byte-addressable characteristics; on the other hand, it does not lose data after a power failure. Moreover, it has higher density and lower power consumption. The emergence of NVM provides new opportunities for improving indexing efficiency, and thus many works focus on building NVM-based indexing. However, these works are conducted based on simulated NVM devices. In April 2019, Intel released real NVM hardware AEP (apache pass) based on 3D-XPoint technology. The actual AEP devices are evaluated, and the results show that the write latency of AEP is close to that of DRAM, while the read latency is 3~4 times that of DRAM. Based on actual NVM hardware performance, we find that many past works have biased performance assumptions about NVM, which leaves some past works open to optimizing space. We then revisit previous persistent indexing works. We propose a read-optimized hybrid index (HybridIndex\++) and a hybrid-memory-based asynchronous caching approach for persistent index. Experimental results show that the read performance of HybridIndex\++ is 1.8 times that of existing hybrid index. The asynchronous cache-optimized indexes can reduce latency by up to 50%.
One-Direction Shift B+-Tree Based on Persistent Memory
Yan Wei, Zhang Xingjun, Ji Zeyu, Dong Xiaoshe, Ji Chenzhao
2021, 58(2):  371-383.  doi:10.7544/issn1000-1239.2021.20200403
Asbtract ( 69 )   PDF (1088KB) ( 59 )  
Related Articles | Metrics
The persistent memory (PM) fascinates many researchers by its high scalability, byte-addressability and low static energy consumption which can contribute to the fusion of main memory and secondary memory. However,the interacting granularity between the volatile last level cache (LLC) and the non-volatile PM is normally 64B which is much larger than 8B, the granularity of the atomic updating operations provided by the main memory. Once an outage takes place during the process of transporting data from LLC to PM, data integration on the PM may be destroyed, which will lead to a breaking down of failure atomicity. Most researches are used to solve this problem by forcing the data persisted following some order implemented with flush and fence instructions which are always accompanied with large overhead. This overhead will be amplified especially in index updating which often leads to some transformation of index structures. In this paper, we focus on reducing the persisting overhead for ensuring the failure atomicity of updating a PM-based B\++-Tree. We propose a one direction shifting (ODS) algorithm based on the real data layout in the tree node as well as the node utility, the persisting overhead of different updating mode and the relation between different operations. This algorithm implements the in-place deletion to reduce the deleting overhead and takes advantage of the pits generated by in-place deletion to further reduce the shifting overhead of insertion. We redesign the rebalancing process of our persistent B\++-Tree according to the ODS algorithm. Experiments show that our proposed ODS tree can outperform the state-of-the-art PM trees on single and synthetic workloads.
A Distributed Persistent Memory File System Based on RDMA Multicast
Chen Maotang, Zheng Sheng’an, You Litong, Wang Jingyu, Yan Tian, Tu Yaofeng, Han Yinjun, Huang Linpeng
2021, 58(2):  384-396.  doi:10.7544/issn1000-1239.2021.20200369
Asbtract ( 100 )   PDF (1879KB) ( 67 )  
Related Articles | Metrics
The development of persistent memory and remote direct memory access(RDMA) provides new opportunities for designing efficient distributed systems. However, the existing RDMA-based distributed systems are far from fully exploiting RDMA multicast capabilities, which makes them difficult to solve the problem of multi-copy file data transmission in one-to-many transmission, degrading system performance. In this paper, a distributed persistent memory and RDMA multicast transmission based file system(MTFS) is proposed. It efficiently transmits data to different data nodes by the low-latency multicast transmission mechanism, which makes full use of the RDMA multicast capability, hence avoiding high latency due to multi-copy file data transmission operations. To improve the flexibility of transmission operations, a multi-mode multicast remote procedure call(RPC) mechanism is proposed, which enables the adaptive recognition of RPC requests, and moves transmission operations out of the critical path to further improve transmission efficiency. MTFS also provides a lightweight consistency guarantee mechanism. By designing a crash recovery mechanism, a data verification module and a retransmission scheme, MTFS is able to quickly recover from a crash, and achieves file system reliability and data consistency by error detection and data correction. Experimental results show that MTFS has greatly increased the throughput by 10.2-219 times compared with GlusterFS. MTFS outperforms NOVA by 10.7% on the Redis workload, and achieves good scalability in multi-thread workloads.
A Multicore-Friendly Persistent Memory Key-Value Store
Wang Qing, Zhu Bohong, Shu Jiwu
2021, 58(2):  397-405.  doi:10.7544/issn1000-1239.2021.20200381
Asbtract ( 91 )   PDF (765KB) ( 107 )  
Related Articles | Metrics
Compared with traditional memory, i.e., DRAM, persistent memory has the characteristics of large capacity and non-volatility, which brings new opportunities for building large-scale persistent memory key-value store systems. However, designing a persistent memory key-value store under a multicore server architecture faces many challenges, including CPU cache thrash due to concurrency control, the consumption and competition of limited write bandwidth for persistent memory, and aggravated conflicts between threads caused by high latency of persistent memory. In this paper, MPKV, a multicore-friendly persistent memory key-value store is proposed. By designing efficient concurrency control methods and reducing write operations to persistent memory, the system’s multicore concurrency performance is fully improved. To avoid the extra persistent memory write bandwidth consumption caused by lock resources, MPKV introduces a volatile lock management mechanism to separate write lock resources from indexes and maintain them separately in DRAM. To ensure crash consistency and improve concurrent query performance, MPKV introduces a two-phase atomic write mechanism, which uses the atomic write operation instructions provided by the CPU to atomically switch the system from one consistent state to another consistent state, and further supports lock-free query. Based on the volatile lock management mechanism, MPKV also introduces a concurrent write elimination mechanism to improve the efficiency of concurrency between update operations. When two conflicting update operations occur, the concurrent write elimination mechanism allows one of the operations to return directly without any persistent memory allocation and write. Experiments show that MPKV has better performance and multicore scalability than pmemkv. Specifically, in the 18-thread environment, the throughput of MPKV reaches 1.7 to 6.2 times of pmemkv.
MixStore: Back-End Storage Based on Persistent Memory and SSD
Tu Yaofeng, Chen Zhenghua, Han Yinjun, Chen Bing, Guan Donghai
2021, 58(2):  406-417.  doi:10.7544/issn1000-1239.2021.20200389
Asbtract ( 76 )   PDF (1157KB) ( 68 )  
Related Articles | Metrics
Persistent memory (PMEM) has both the low-latency byte addressing of memory and the persistence characteristics of disk, which will bring revolutionary changes and far-reaching impact on the existing software architecture system. Distributed storage has been widely used in cloud computing and data centers, but the existing back-end storage engines represented by Ceph BlueStore are designed for traditional mechanical disks and solid state disks(SSD), and its original optimization design mechanism is not suitable for exerting the advantages of PMEM. In this paper, a back-end storage MixStore is proposed based on PMEM and SSD. Through the volatile range marking and to-be-deleted list technology, a Concurrent SkipList suitable for persistent memory is implemented, which is used to replace RocksDB to implement the metadata management mechanism. While ensuring transaction consistency, it eliminates the performance jitter and other issues caused by BlueStore’s compaction, and improves the concurrent access performance of metadata. The storage optimization design of data objects based on the metadata management mechanism is adopted, non-aligned small blocks of data objects are stored in PMEM, and the aligned large blocks of data objects are stored in SSD. This makes full use of the byte addressing and durability characteristics of the PMEM, and the large-capacity and low-cost advantage of the SSD. The data update policy is optimized based on the delayed writing and copy-on-write (CoW) technology, eliminating the write amplification caused by the WAL log of BlueStore and improving the performance of small data writing. The test results show that under the same hardware environment, MixStore’s write throughput is increased by 59% and the write latency is reduced by 37% compared with that of BlueStore, which effectively improves the system performance.
Estimating QoE for OTT Video Service Through XDR Data Analysis
Huang Lisheng, Ran Jinye, Luo Jing, Zhang Xiangyin
2021, 58(2):  418-426.  doi:10.7544/issn1000-1239.2021.20190759
Asbtract ( 81 )   PDF (1119KB) ( 176 )  
Related Articles | Metrics
Over the top (OTT) video services have gradually become one of the most popular online services. However, due to poor network quality, overload of service platform and other reasons, OTT services often encounter quality problems such as playback failure, increased number of stuck, and long buffer time, which lead to the decline of quality of experience (QoE). Internet service providers need to accurately evaluate the QoE of OTT video services so as to identify quality problems in advance and further optimize networks and services. In this paper, a no-reference OTT video quality estimation method based on extended data record (XDR) data analysis is proposed. It extracts a small amount of information with high relevance to video quality from a large amount of XDR data, and converts large-scale, low-value XDR data into high-value and small-scale video quality feature information. This method facilitates the application of subsequent artificial intelligence algorithms for OTT video quality evaluation, reduces the cost of further data mining, and improves the accuracy of machine learning model and QoE evaluation results. The data extracted by this method is used for OTT video QoE prediction, and experimental results show that the accuracy of the QoE prediction results is significantly better than that of the current evaluation results using the original XDR data.
Visual Tracking Algorithm Based on Adaptive Spatial Regularization
Tan Jianhao, Zhang Siyuan
2021, 58(2):  427-435.  doi:10.7544/issn1000-1239.2021.20200021
Asbtract ( 62 )   PDF (5845KB) ( 67 )  
Related Articles | Metrics
In the visual tracking algorithm based on correlation filters, the method of generating sample sets by cyclic shift greatly reduces the amount of calculation. However, it will also bring about boundary effects, and the resulting error samples will weaken the discriminative ability of the classifier. In order to solve the above problem, a visual tracking algorithm based on adaptive spatial regularization is proposed. An adaptive spatial regularization term is introduced into the classic filtering model. By establishing the correlation of regularization weights between adjacent frames, the regularization weights of the model can be adaptively adjusted. In this way, the risk of overfitting when processing unreal samples can be reduced, thereby mitigating the boundary effect. We adopt a scale estimation strategy with adaptive aspect ratio, which can accurately track the scale change of the target. In addition, the update strategy based on the similarity of color histograms is used to avoid the model update when the tracking is inaccurate, thereby suppressing model drift and improving tracking accuracy and speed. Experiments show that the success rate and accuracy of our algorithm on UAV123, OTB2013, OTB2015 are higher than all the compared algorithms. And even in various complex scenes, our algorithm can still maintain a high tracking success rate. Especially in the presence of motion blur and in-plane rotation, the success rate scores are 9.72% and 9.03% higher than the second best algorithm, respectively, which shows that the algorithm has good adaptability.
Infrared and Visible Image Fusion Based on Semantic Segmentation
Zhou Huabing, Hou Jilei, Wu Wei, Zhang Yanduo, Wu Yuntao, Ma Jiayi
2021, 58(2):  436-443.  doi:10.7544/issn1000-1239.2021.20200244
Asbtract ( 131 )   PDF (1755KB) ( 244 )  
Related Articles | Metrics
Infrared images can distinguish targets from their backgrounds due to the difference in thermal radiation even in poor lighting conditions. By contrast, visible images can represent texture details with high spatial resolution. Meanwhile, both of infrared and visible images preserve corresponding semantic information. Therefore, infrared and visible image fusion should keep both radiation information of the infrared image and texture details of the visible image; additionally, it needs to reserve the semantic information of both. Semantic segmentation can transform the source images into the masks with semantic information. In this paper, an infrared and visible image fusion method is proposed based on semantic segmentation. It can overcome the shortcomings that the existing fusion methods are not specific to different regions. Considering the specific information for different regions of infrared and visible images, we design two loss functions for different regions to improve the quality of fused image under the framework of generative adversarial network. Firstly, we gain the masks of the infrared images with semantic information by semantic segmentation; then we use the masks to divide the infrared and visible images into infrared target area, infrared background area, visible target area, and visible background area. Secondly, we employ different methods to fuse the target and background area, respectively. Finally, we combine the two regions to obtain the final fused image. The experiment shows that the proposed method outperforms state-of-the-art, where our results have higher contrast in the target area and richer texture details in the background area.