ISSN 1000-1239 CN 11-1777/TP



    Default Latest Most Read
    Please wait a minute...
    For Selected: Toggle Thumbnails
    A Low-Latency Storage Engine with Low CPU Overhead
    Liao Xiaojian, Yang Zhe, Yang Hongzhang, Tu Yaofeng, Shu Jiwu
    Journal of Computer Research and Development    2022, 59 (3): 489-498.   DOI: 10.7544/issn1000-1239.20210574
    Abstract597)   HTML0)    PDF (770KB)(319)       Save
    The latency of solid-state drive (SSD) has improved dramatically in recent years. For example, an ultra-low latency SSD can process 4 KB data in 10 microseconds. With this low latency, how to reap the I/O completion efficiently becomes an important issue in modern storage systems. Traditional storage systems reap I/O completion through hardware interrupt, which introduces extra context switches overhead and further prolongs the overall I/O latency. Existing work use polling as an alternative to the hardware interrupt, thereby eliminating the context switches, but at the cost of high CPU consumption. This paper proposes a CPU-efficient and low-latency storage engine namely NIO, to take full advantage of the ultra-low latency SSDs. The key idea of NIO is to separate the I/O paths of short I/Os from that of long I/Os; NIO uses classic hardware interrupt for long I/Os, as polling long I/Os does not bring significant improvement but incurs huge CPU overhead; for short I/Os, NIO introduces lazy polling, which lets the I/O thread sleep for a variable time interval before continuously polling, thereby achieving low latency with low CPU consumption. NIO further introduces transaction-aware I/O reaping mechanism to reduce the transaction latency, and a dynamic adjustment mechanism to cope with the dynamic changes of the workload and internal activities of the device. Under dynamic workloads, NIO shows comparable performance against polling-based storage engine while reducing the CPU consumption by at least 59%.
    Related Articles | Metrics
    A Scalable Timestamp-Based Durable Software Transactional Memory
    Liu Chaojie, Wang Fang, Zou Xiaomin, Feng Dan
    Journal of Computer Research and Development    2022, 59 (3): 499-517.   DOI: 10.7544/issn1000-1239.20210565
    Abstract338)   HTML0)    PDF (4326KB)(218)       Save
    The emerging non-volatile memory(NVM) provides a lot of advantages, including byte-addressability, durability, large capacity and low energy consumption. However, it is difficult to perform concurrent programming on NVM, because users have to ensure not only the crash consistency but also the correctness of concurrency. In order to reduce the development difficulty, persistent transactional memory has been proposed, but most of the existing persistent transactional memory has poor scalability. Through testing, we find that the limiting factors of scalability are global logical clock and redundant NVM write operation. In order to eliminate the impact of these two factors on scalability: A thread logical clock method is proposed, which eliminates the problem of global logical clock centralization by allowing each thread to have an independent clock; a dual version method of cache line awareness is proposed, which maintains two versions of the data, and updates the two versions cyclically to ensure the crash consistency of the data, thereby eliminating redundant NVM write operations. And based on these two methods, a scalable durable transactional memory (SDTM) is implemented and fully tested. The results show that under YCSB workload, compared with DudeTM and PMDK, its performance is up to 2.8 times and 29 times higher, respectively.
    Related Articles | Metrics
    PPO-Based Automated Quantization for ReRAM-Based Hardware Accelerator
    Wei Zheng, Zhang Xingjun, Zhuo Zhimin, Ji Zeyu, Li Yonghao
    Journal of Computer Research and Development    2022, 59 (3): 518-532.   DOI: 10.7544/issn1000-1239.20210551
    Abstract196)   HTML0)    PDF (5749KB)(127)       Save
    Convolutional neural networks have already exceeded the capabilities of humans in many fields. However, as the memory consumption and computational complexity of CNNs continue to increase, the “memory wall” problem, which constrains the data exchange between the processing unit and memory unit, impedes their deployments in resource-constrained environments, such as edge computing and the Internet of Things. ReRAM(resistive RAM)-based hardware accelerator has been widely applied in accelerating the computing of matrix-vector multiplication due to its advantages in terms of high density and low power, but are not adept for 32 b floating-point data computation, raising the demand for quantization to reduce the data precision. Manually determining the bitwidth for each layer is time-consuming, therefore, recent studies leverage DDPG(deep deterministic policy gradient) to perform automated quantization on FPGA(field programmable gate array) platform, but it needs to convert continuous actions into discrete actions and the resource constraints are achieved by manually decreasing the bitwidth of each layer. This paper proposes a PPO(proximal policy optimization)-based automated quantization for ReRAM-based hardware accelerator, which uses discrete action space to avoid the action space conversion step.We define a new reward function to enable the PPO agent to automatically learn the optimal quantization policy that meets the resource constraints, and give software-hardware modifications to support mixed-precision computing. Experimental results show that compared with coarse-grained quantization, the proposed method can reduce hardware cost by 20%~30% with negligible loss of accuracy. Compared with other automatic quantification, the proposed method has a shorter search time and can further reduce the hardware cost by about 4.2% under the same resource constraints. This provides insights for co-design of quantization algorithm and hardware accelerator.
    Related Articles | Metrics
    Energy-Efficient Floating-Point Memristive In-Memory Processing System Based on Self-Selective Mantissa Compaction
    Ding Wenlong, Wang Chengning, Tong Wei
    Journal of Computer Research and Development    2022, 59 (3): 533-552.   DOI: 10.7544/issn1000-1239.20210580
    Abstract196)   HTML0)    PDF (3743KB)(119)       Save
    Matrix-vector multiplication (MVM) is a key computing kernel for solving high-performance scientific systems. Recent work by Feinberg et al has proposed a method of deploying high-precision operands on memristive crossbars, showing its great potential on accelerating scientific MVM. Since different types of scientific computing applications have different precision requirements, providing appropriate computation methods for specific applications is an effective way to further reduce energy consumption. This paper proposes a system with mantissa compaction and alignment optimization strategies. Under the premise of implementing the basic function of high-precision floating-point memristive MVM, the proposed system is also possible to properly select the compaction bits of the floating-point mantissa according to application precision requirements. By neglecting the activation of the low-bit crossbars with less mantissa significance and the redundant alignment crossbars when performing computation, the energy consumption of computational crossbars and peripheral circuits are significantly reduced. The evaluation result shows that when the crossbar-based in-memory solutions of sparse linear systems have average solving residual of 0~10\+\{-3\} order of magnitude compared with the software baseline, the average energy consumption of computational crossbars and peripheral analog-to-digital converters are reduced by 5%~65% and 30%~55% compared with the existing work without optimization, respectively.
    Related Articles | Metrics
    Endurance Aware Out-of-Place Update for Persistent Memory
    Cai Changxing, Du Yajuan, Zhou Taiyu
    Journal of Computer Research and Development    2022, 59 (3): 553-567.   DOI: 10.7544/issn1000-1239.20210541
    Abstract169)   HTML0)    PDF (1984KB)(83)       Save
    Persistent memory has excellent characteristics such as non-volatility, byte-addressable, fast random read and write speed, low energy consumption, and large scalability, which provides new opportunities for big data storage and processing. However, the problem of crash consistency of persistent memory systems poses challenges to its widespread application. Existing research work on crash consistency guarantee usually takes extra read and write as the cost, which has a certain impact on the performance and lifetime of persistent memory systems in the time and space dimensions. To reduce this impact, an endurance aware out-of-place update for persistent memory(EAOOP) is proposed. Through software transparent out-of-place update technology, endurance aware memory management is provided for persistent memory, and data is alternately refreshed to the original data region and the updated data region. EAOOP not only guarantees the system’s crash consistency but also avoids redundant data merging operations. At the same time, to efficiently use the memory space, a lightweight garbage collection is performed in the background to process the old data in the updated data region, reducing extra write amplification and bandwidth occupation, thereby further reducing the impact on the lifetime and performance of the persistent memory. Evaluations show that EAOOP has higher performance and less overhead compared with existing work. Among them, the transaction throughput is increased by 1.6 times, and the critical path latency and the write number are decreased by 1.3 times.
    Related Articles | Metrics
    DRAM-Based Victim Cache for Page Migration Mechanism on Heterogeneous Main Memory
    Pei Songwen, Qian Yihuan, Ye Xiaochun, Liu Haikun, Kong Linghe
    Journal of Computer Research and Development    2022, 59 (3): 568-581.   DOI: 10.7544/issn1000-1239.20210567
    Abstract268)   HTML1)    PDF (2837KB)(176)       Save
    When massive data access heterogeneous memory systems, memory pages often migrate between DRAM and NVM. However, the traditional memory page migration strategy is difficult to adapt to the rapid dynamic changes among “hot” and “cold” memory pages. The “cold” pages just migrated from DRAM to NVM will become “hot” again, which results in a large number of redundant migrations, as well as false migrations. Previous related researches only focus on pages that are being migrated without paying too much attention to pages that in the migration waiting queue or that have been migrated. Therefore, this paper proposes a heterogeneous memory page migration mechanism based on DRAM-based victim Cache (VC-HMM) by adding a small capacity of victim Cache between DRAM and PCM. The “cold” pages will be migrated from DRAM to victim Cache. DRAM victim Cache can avoid redundant migrations caused by the main memory pages getting hot again in a short time. Meanwhile, some pages do not need to be written back to PCM that can reduce the number of write operations on PCM and extend the lifetime of PCM. In particular, VC-HMM can automatically update the execution parameters of migration for different workloads to increase the rationality of migration. Experimental results show that compared with other migration strategies (CoinMigrator, MQRA, THMigrator), VC-HMM reduces the average number of PCM write operations by 62.97%, the average access latency by 22.72%, the re-migration times by 38.37%, and the energy consumption by 3.40%.
    Related Articles | Metrics
    Decoding Method of Reed-Solomon Erasure Codes
    Tang Dan, Cai Hongliang, Geng Wei
    Journal of Computer Research and Development    2022, 59 (3): 582-596.   DOI: 10.7544/issn1000-1239.20210575
    Abstract178)   HTML0)    PDF (2885KB)(63)       Save
    As known, RS (Reed-Solomon) codes can construct any fault-tolerant codewords according to the application environment, which has good flexibility, and the storage system using RS erasure code as the fault-tolerant method can achieve the optimal storage efficiency. However, compared with XOR(exclusive-OR)-based erasure codes, RS erasure codes require too much time to decode, which greatly hinders its use in the distributed storage system. In order to solve this problem, this paper proposes a decoding method for RS erasure codes. This new method completely discards the matrix inversion which is commonly used in all current RS erasure codes decoding methods, and only uses the addition and multiplication with less computational complexity, and the linear combination of the invalid symbols by the valid symbols can obtained by the simple matrix transformation on the constructed decoding transforming matrix, thereby reducing the complexity of decoding calculations. Finally, the correctness of the method is proved theoretically, and for each file of different sizes, three file blocks of different sizes are divided, and the data blocks obtained by the division are tested. The experimental results show that in the case of different file block sizes, the new decoding method has lower decoding time cost than other methods.
    Related Articles | Metrics
    Near-Data Processing-Based Parallel Compaction Optimization for Key-Value Stores
    Sun Hui, Lou Bendong, Huang Jianzhong, Zhao Yuhong, Fu Song
    Journal of Computer Research and Development    2022, 59 (3): 597-616.   DOI: 10.7544/issn1000-1239.20210577
    Abstract174)   HTML0)    PDF (8746KB)(75)       Save
    Large-scale unstructured data management brings unprecedented challenges to existing relational databases. The log-structured merge tree (LSM-tree) based key-value store has been widely used and plays an essential role in data-intensive applications. The LSM-tree can convert random-write operations into sequential ones, thereby improving write performance. However, the LSM-tree key-value storage system also has some problems. First, the key-value storage system uses compaction operations to update data to balance system performance, but it impacts system performance and causes serious write amplification. Second, the traditional computing-centric data transmission also limits the overall system performance in compaction. This paper applied the data-centric near-data processing (NDP) model in the storage system. We propose a collaborative parallel compaction optimization for LSM-tree key-value stores named CoPro. The two parallel (i.e., data and pipeline parallelism) are fully utilized to improve compaction performance. When the compaction is triggered, the host-side CoPro determines the partitioning ratio of the compaction tasks according to the offloading strategy and divides tasks according to the ratio. Then, compaction subtasks are offloaded to the host and device sides, respectively, through the semantic management module. We design a decision component in the host-side and device-side CoPro, which is remarked as CoPro+. CoPro+ can dynamically adjust the parallelism according to changes in the resource of system and the value of key-value pairs in workloads. Extensive experimental results validate the benefits of CoPro compared with two popular NDP-based key-value stores.
    Related Articles | Metrics