Advanced Search

    2022  Vol. 59  No. 3

    Abstract:
    The latency of solid-state drive (SSD) has improved dramatically in recent years. For example, an ultra-low latency SSD can process 4 KB data in 10 microseconds. With this low latency, how to reap the I/O completion efficiently becomes an important issue in modern storage systems. Traditional storage systems reap I/O completion through hardware interrupt, which introduces extra context switches overhead and further prolongs the overall I/O latency. Existing work use polling as an alternative to the hardware interrupt, thereby eliminating the context switches, but at the cost of high CPU consumption. This paper proposes a CPU-efficient and low-latency storage engine namely NIO, to take full advantage of the ultra-low latency SSDs. The key idea of NIO is to separate the I/O paths of short I/Os from that of long I/Os; NIO uses classic hardware interrupt for long I/Os, as polling long I/Os does not bring significant improvement but incurs huge CPU overhead; for short I/Os, NIO introduces lazy polling, which lets the I/O thread sleep for a variable time interval before continuously polling, thereby achieving low latency with low CPU consumption. NIO further introduces transaction-aware I/O reaping mechanism to reduce the transaction latency, and a dynamic adjustment mechanism to cope with the dynamic changes of the workload and internal activities of the device. Under dynamic workloads, NIO shows comparable performance against polling-based storage engine while reducing the CPU consumption by at least 59%.
    Abstract:
    The emerging non-volatile memory(NVM) provides a lot of advantages, including byte-addressability, durability, large capacity and low energy consumption. However, it is difficult to perform concurrent programming on NVM, because users have to ensure not only the crash consistency but also the correctness of concurrency. In order to reduce the development difficulty, persistent transactional memory has been proposed, but most of the existing persistent transactional memory has poor scalability. Through testing, we find that the limiting factors of scalability are global logical clock and redundant NVM write operation. In order to eliminate the impact of these two factors on scalability: A thread logical clock method is proposed, which eliminates the problem of global logical clock centralization by allowing each thread to have an independent clock; a dual version method of cache line awareness is proposed, which maintains two versions of the data, and updates the two versions cyclically to ensure the crash consistency of the data, thereby eliminating redundant NVM write operations. And based on these two methods, a scalable durable transactional memory (SDTM) is implemented and fully tested. The results show that under YCSB workload, compared with DudeTM and PMDK, its performance is up to 2.8 times and 29 times higher, respectively.
    Abstract:
    Convolutional neural networks have already exceeded the capabilities of humans in many fields. However, as the memory consumption and computational complexity of CNNs continue to increase, the “memory wall” problem, which constrains the data exchange between the processing unit and memory unit, impedes their deployments in resource-constrained environments, such as edge computing and the Internet of Things. ReRAM(resistive RAM)-based hardware accelerator has been widely applied in accelerating the computing of matrix-vector multiplication due to its advantages in terms of high density and low power, but are not adept for 32 b floating-point data computation, raising the demand for quantization to reduce the data precision. Manually determining the bitwidth for each layer is time-consuming, therefore, recent studies leverage DDPG(deep deterministic policy gradient) to perform automated quantization on FPGA(field programmable gate array) platform, but it needs to convert continuous actions into discrete actions and the resource constraints are achieved by manually decreasing the bitwidth of each layer. This paper proposes a PPO(proximal policy optimization)-based automated quantization for ReRAM-based hardware accelerator, which uses discrete action space to avoid the action space conversion step.We define a new reward function to enable the PPO agent to automatically learn the optimal quantization policy that meets the resource constraints, and give software-hardware modifications to support mixed-precision computing. Experimental results show that compared with coarse-grained quantization, the proposed method can reduce hardware cost by 20%~30% with negligible loss of accuracy. Compared with other automatic quantification, the proposed method has a shorter search time and can further reduce the hardware cost by about 4.2% under the same resource constraints. This provides insights for co-design of quantization algorithm and hardware accelerator.
    Abstract:
    Matrix-vector multiplication (MVM) is a key computing kernel for solving high-performance scientific systems. Recent work by Feinberg et al has proposed a method of deploying high-precision operands on memristive crossbars, showing its great potential on accelerating scientific MVM. Since different types of scientific computing applications have different precision requirements, providing appropriate computation methods for specific applications is an effective way to further reduce energy consumption. This paper proposes a system with mantissa compaction and alignment optimization strategies. Under the premise of implementing the basic function of high-precision floating-point memristive MVM, the proposed system is also possible to properly select the compaction bits of the floating-point mantissa according to application precision requirements. By neglecting the activation of the low-bit crossbars with less mantissa significance and the redundant alignment crossbars when performing computation, the energy consumption of computational crossbars and peripheral circuits are significantly reduced. The evaluation result shows that when the crossbar-based in-memory solutions of sparse linear systems have average solving residual of 0~10\+\{-3\} order of magnitude compared with the software baseline, the average energy consumption of computational crossbars and peripheral analog-to-digital converters are reduced by 5%~65% and 30%~55% compared with the existing work without optimization, respectively.
    Abstract:
    Persistent memory has excellent characteristics such as non-volatility, byte-addressable, fast random read and write speed, low energy consumption, and large scalability, which provides new opportunities for big data storage and processing. However, the problem of crash consistency of persistent memory systems poses challenges to its widespread application. Existing research work on crash consistency guarantee usually takes extra read and write as the cost, which has a certain impact on the performance and lifetime of persistent memory systems in the time and space dimensions. To reduce this impact, an endurance aware out-of-place update for persistent memory(EAOOP) is proposed. Through software transparent out-of-place update technology, endurance aware memory management is provided for persistent memory, and data is alternately refreshed to the original data region and the updated data region. EAOOP not only guarantees the system’s crash consistency but also avoids redundant data merging operations. At the same time, to efficiently use the memory space, a lightweight garbage collection is performed in the background to process the old data in the updated data region, reducing extra write amplification and bandwidth occupation, thereby further reducing the impact on the lifetime and performance of the persistent memory. Evaluations show that EAOOP has higher performance and less overhead compared with existing work. Among them, the transaction throughput is increased by 1.6 times, and the critical path latency and the write number are decreased by 1.3 times.
    Abstract:
    When massive data access heterogeneous memory systems, memory pages often migrate between DRAM and NVM. However, the traditional memory page migration strategy is difficult to adapt to the rapid dynamic changes among “hot” and “cold” memory pages. The “cold” pages just migrated from DRAM to NVM will become “hot” again, which results in a large number of redundant migrations, as well as false migrations. Previous related researches only focus on pages that are being migrated without paying too much attention to pages that in the migration waiting queue or that have been migrated. Therefore, this paper proposes a heterogeneous memory page migration mechanism based on DRAM-based victim Cache (VC-HMM) by adding a small capacity of victim Cache between DRAM and PCM. The “cold” pages will be migrated from DRAM to victim Cache. DRAM victim Cache can avoid redundant migrations caused by the main memory pages getting hot again in a short time. Meanwhile, some pages do not need to be written back to PCM that can reduce the number of write operations on PCM and extend the lifetime of PCM. In particular, VC-HMM can automatically update the execution parameters of migration for different workloads to increase the rationality of migration. Experimental results show that compared with other migration strategies (CoinMigrator, MQRA, THMigrator), VC-HMM reduces the average number of PCM write operations by 62.97%, the average access latency by 22.72%, the re-migration times by 38.37%, and the energy consumption by 3.40%.
    Abstract:
    As known, RS (Reed-Solomon) codes can construct any fault-tolerant codewords according to the application environment, which has good flexibility, and the storage system using RS erasure code as the fault-tolerant method can achieve the optimal storage efficiency. However, compared with XOR(exclusive-OR)-based erasure codes, RS erasure codes require too much time to decode, which greatly hinders its use in the distributed storage system. In order to solve this problem, this paper proposes a decoding method for RS erasure codes. This new method completely discards the matrix inversion which is commonly used in all current RS erasure codes decoding methods, and only uses the addition and multiplication with less computational complexity, and the linear combination of the invalid symbols by the valid symbols can obtained by the simple matrix transformation on the constructed decoding transforming matrix, thereby reducing the complexity of decoding calculations. Finally, the correctness of the method is proved theoretically, and for each file of different sizes, three file blocks of different sizes are divided, and the data blocks obtained by the division are tested. The experimental results show that in the case of different file block sizes, the new decoding method has lower decoding time cost than other methods.
    Abstract:
    Large-scale unstructured data management brings unprecedented challenges to existing relational databases. The log-structured merge tree (LSM-tree) based key-value store has been widely used and plays an essential role in data-intensive applications. The LSM-tree can convert random-write operations into sequential ones, thereby improving write performance. However, the LSM-tree key-value storage system also has some problems. First, the key-value storage system uses compaction operations to update data to balance system performance, but it impacts system performance and causes serious write amplification. Second, the traditional computing-centric data transmission also limits the overall system performance in compaction. This paper applied the data-centric near-data processing (NDP) model in the storage system. We propose a collaborative parallel compaction optimization for LSM-tree key-value stores named CoPro. The two parallel (i.e., data and pipeline parallelism) are fully utilized to improve compaction performance. When the compaction is triggered, the host-side CoPro determines the partitioning ratio of the compaction tasks according to the offloading strategy and divides tasks according to the ratio. Then, compaction subtasks are offloaded to the host and device sides, respectively, through the semantic management module. We design a decision component in the host-side and device-side CoPro, which is remarked as CoPro+. CoPro+ can dynamically adjust the parallelism according to changes in the resource of system and the value of key-value pairs in workloads. Extensive experimental results validate the benefits of CoPro compared with two popular NDP-based key-value stores.
    Abstract:
    Probabilistic generative models are important methods for knowledge representation. Exact probabilistic inference methods are intractable in these models, and various approximate inferences are required. The variational inferences are important deterministic approximate inference methods with rapid convergence and solid theoretical foundations, and they have become the research hot in probabilistic generative models with the development of big data especially. In this paper, we first present a general variational inference framework for probabilistic generative models, and analyze the parameter learning process of the models based on variational inference. Then, we give the framework of analytic optimization of variational inference for the conditionally conjugated exponential family, and introduce the stochastic variational inference based on the framework, which can scale to big data with the stochastic strategy. Furthermore, we provide the framework of black box variational inferences for the general probability generative models, which train the model parameters of variational distributions based on the stochastic gradients; and analyze the realization of different variational inference algorithms under the framework. Finally, we summarize the structured variational inferences, which improve the inference accuracy by enriching the variational distributions with different strategies. In addition, we discuss the further development trends of variational inference for probabilistic generative models.
    Abstract:
    Data stream classification is one of the most important tasks in data mining. The performance of a model classifier degrades due to concept drift even in stationary data; dealing with this problem hence becomes more challenging in data streams. The extreme learning machine is widely used in data stream classification. However, the parameters of the extreme learning machine have to be determined in advance. It is not applicable for data stream classification since the fixed parameters cannot adapt a change in the concept or distribution of dataset over the time. To tackle this problem, this paper proposes an adaptive online sequential extreme learning machine algorithm. It outperforms the existing approaches in terms of classification results and adaptability of concept drift. It has an adjustable mechanism for model complexity so that the performance of the classification is improved. The proposed extreme learning machine is robust for the concept drift via adaptive learning based on a forgetting factor and the concept drift detection. In addition, the proposed algorithm is able to detect anomalies to prevent classification decision boundaries from being ruined. Extensive experiments demonstrate that the proposed approach outperforms competitors in terms of stability, classification accuracy, and adaptive ability. Moverover, the effectiveness of the proposed mechanisms has been approved via ablation experiments.
    Abstract:
    The popularity of mobile positioning terminals makes users’ locations be easily accessible, which contributes huge amount of trajectory data. Universal companion pattern mining aims at discovering those highly overlapping behavior paths between moving objects in spatio-temporal dimensions, and it is very valuable and challenging to provide effective and efficient pattern mining methods on large-scale trajectories. Obviously, the mining strategy on a centralized environment is incompetent for the consideration of scalability caused by huge and growing trajectory data. Existing distributed mining frameworks are weak in both providing effective input for efficient pattern mining and the processing ability on a large number of loose connections in massive trajectories, which should be covered to improve mining performance. In this study, we propose a distributed two-stage mining framework, DMFUCP, which embeds optimization on data preprocessing and loose connection analysis to provide more efficient and effective universal companion pattern mining. In the data preprocessing stage of DMFUCP, we design both a density clustering algorithm DBSCANCD and a clustering balance algorithm TCB to input high-quality trajectory data with less noisy for mining tasks. In the mining stage of DMFUCP, we propose both a G pruning repartition algorithm GSPR and a segmented enumeration algorithm SAE. GSPR introduces a parameter G to segment long trajectories and then repartitions all segments to improve the processing effectiveness on loose connections. SAE guarantees the mining performance through multithreading and forward closure. Compared with those existing companion pattern mining frameworks on real datasets, DMFUCP reduces the time required to mine each set of universal companion pattern by 20% to 40% while providing better universal companion pattern discovery capabilities.
    Abstract:
    The data-driven constructed extended belief rule-based system can deal with uncertainty problems with both quantitative data and qualitative knowledge. It has been widely researched and applied in recent years, but infrequently been involved in the field of incomplete data. This study conducts research focusing on the performance of the extended belief rule-based system applied to incomplete datasets and proposes a novel reasoning approach for the case of data missing. First, a disjunctive extended rule base is constructed and the optimal number of antecedent attribute referential values is discussed through validation experiments. Then a method for generating a disjunctive belief rule base from incomplete data and consisting of disjunctive belief rule base is proposed, and an attenuation factor is introduced to modify the weight of incomplete rules to make the aggregation of information more reasonable. Finally, this paper conducts experiments on several commonly used datasets selected from UCI to validate the improvement of the proposed method. The experiments are designed with various degrees and patterns of data missing, and the performance of the improved system is analyzed and compared with some conventional mechanisms. Experimental comparison with other methods shows that while the new method performs well on complete datasets, it also shows better and more stable inference effects on datasets with different degrees of missing and patterns.
    Abstract:
    Deep learning models have been widely used in the field of healthcare prediction tasks and have achieved good results in recent ears. However, deep learning models often face the problems of insufficient labeled training data, the overall data distribution shift, and the category level data distribution shift, which leads to a decrease in the accuracy of the models. To solve the above problems, we propose an unsupervised domain adaptation method based on metric learning (additive margin softmax based adversarial domain adaptation, AMS-ADA). Firstly, this method uses the long short-term memory network with the attention mechanism to extract features. Secondly, this method introduces the idea of the generative adversarial network and reduces the overall data distribution shift via adversarial domain adaptation. Thirdly, this method introduces the idea of metric learning, which further reduces the category level data distribution shift by maximizing the decision boundary in the angular space. This method improves the effect of domain adaptation and the accuracy of the model. We perform the mortality prediction task of ICU patients in real-world healthcare datasets. The experimental results show that compared with other baseline models, our method can better solve the problem of data distribution shift and achieve better classification accuracy.
    Abstract:
    The expression of specious sign language in life is ambiguous, and the semantics of substandard gesture actions are easy to be confused. At the same time, it is difficult to obtain sufficient features for training sign language recognition model with finite samples, and the model is easy to over fit when it is too complex, which leads to low recognition accuracy. In order to solve this problem, we propose a representation learning method to expand the tolerant features of sub-standard sign language recognition with finite samples. This method based on the skeleton information of human body, facing the spatiotemporal correlation of sign language, constructes a autoencoder to extract standard features from a small number of original samples in sign language corpus; a large number of substandard samples are generated from standard features by generative adversarial networks, and then fault-tolerant features are extended by autoencoder to construct new features for subsequent sign language recognition tasks. The experimental results show that, under the condition of limited samples, the semantics of the samples generated by this method are clear, and the features of different semantics in the new feature set are easy to be divided. Using this method to build tolerant feature set in CSL dataset, the training sign language recognition model achieves 97.5% recognition accuracy, which indicates that it has broad application prospects.
    Abstract:
    Facial information and voice cues are the most natural and flexible ways in human-computer interaction, and some recent researchers are now paying more attention to the intelligent cross-modal perception between the face and voice modalities. Nevertheless, most existing methods often fail to perform well on some challenge cross-modal face-voice matching tasks, mainly due to the complex integration of semantic gap and modality heterogeneity. In this paper, we address an efficient cross-modal face-voice matching network by using double-stream networks and bi-quintuple loss, and the derived feature representations can be well utilized to adapt four challenging cross-modal matching tasks between faces and voices. First, we introduce a novel modality-shared multi-modal weighted residual network to model the face-voice association, by embedding it on the top layer of our double-stream network. Then, a bi-quintuple loss is newly proposed to significantly improve the data utilization, while enhancing the generalization ability of network model. Further, we learn to predict identity (ID) of each person during the training process, which can supervise the discriminative feature learning process. As a result, discriminative cross-modal representations can be well learned for different matching tasks. Within four different cross-modal matching tasks, extensive experiments have shown that the proposed approach performs better than the state-of-the-art methods, by a large margin reaching up to 5%.
    Abstract:
    Smart device, such as smart phones, can record their geographical positions and optical parameters into image files when capturing photos and videos. We can extract and use the information to restore the sector-shaped geographical area, which is known as the field-of-view (FOV). We store the images together with their FOVs in order to support the spatial queries on images. Typically, a user circles a region on the map as a query, and the computer returns the images which capture the region. In fact, the query is to find the FOVs that intersect with the query region. To make FOV queries fast, an index should be built. However, the existing indexes exploit the sector-shaped feature inadequately. We design a convex polygon tree to index pentagons which are the approximations of sectors. Each tree node is a k\+*-convex polygon, which encloses a set of polygons. The number of sides of the polygon must be no more than k, and the difference between the polygon and its elements should be minimized. We propose a submerging algorithm to compute such k\+*-convex polygons. To build a convex polygon tree, we insert FOVs iteratively. Each time we select a best leaf node for insertion. The selection criteria are to make the dead space and the increasing space as small as possible, and to make the overlap area between the old node and the FOV as large as possible. We also propose an FOV query algorithm based on the proposed convex polygon tree. The experimental results show that the convex polygon tree performs better than the existing indexes.