ISSN 1000-1239 CN 11-1777/TP

    2020 Advances in Computer Architecture

    Default Latest Most Read
    Please wait a minute...
    For Selected: Toggle Thumbnails
    Journal of Computer Research and Development    2020, 57 (6): 1123-1124.   DOI: 10.7544/issn1000-1239.2020.qy0601
    Abstract1066)   HTML107)    PDF (209KB)(611)       Save
    Related Articles | Metrics
    An Energy Consumption Optimization and Evaluation for Hybrid Cache Based on Reinforcement Learning
    Fan Hao, Xu Guangping, Xue Yanbing, Gao Zan, Zhang Hua
    Journal of Computer Research and Development    2020, 57 (6): 1125-1139.   DOI: 10.7544/issn1000-1239.2020.20200010
    Abstract846)   HTML27)    PDF (3887KB)(566)       Save
    Emerging non-volatile memory STT-RAM has the characteristics of low leakage power, high density, fast read speed, and high write energy. Meanwhile, SRAM has the characteristics of high leakage power, low density, fast read and write speed, low write energy, etc. The hybrid cache of SRAM and STT-RAM fully utilizes the respective advantages of both memory medias, providing lower leakage power and higher cell density than SRAM, higher write speed and lower write energy than STT-RAM. The architecture of hybrid cache mainly achieves both of benefits by putting write-intensive data into SRAM and read-intensive data into STT-RAM. Therefore, how to identify and allocate read-write-intensive data is the key challenge for the hybrid cache design. This paper proposes a cache management method based on the reinforcement learning that uses the write intensity and reuse information of cache access requests to design a cache allocation policy and optimize energy consumption. The key idea is to use the reinforcement learning algorithm to get the weight for the set allocating to SRAM or STT-RAM by learning from the energy consumption of cache line sets. The algorithm allocates a cache line in a set to the region with greater weight. Evaluations show that our proposed policy reduces the average energy consumption by 16.9%(9.7%) in a single-core (quad-core) system compared with the previous policies.
    Related Articles | Metrics
    Optimizing Winograd-Based Fast Convolution Algorithm on Phytium Multi-Core CPUs
    Wang Qinglin, Li Dongsheng, Mei Songzhu, Lai Zhiquan, Dou Yong
    Journal of Computer Research and Development    2020, 57 (6): 1140-1151.   DOI: 10.7544/issn1000-1239.2020.20200107
    Abstract651)   HTML14)    PDF (2411KB)(420)       Save
    Convolutional neural networks (CNNs) have been extensively used in artificial intelligence fields such as computer vision and natural language processing. Winograd-based fast convolution algorithms can effectively reduce the computational complexity of convolution operations in CNNs so that they have attracted great attention. With the application of Phytium multi-core CPUs independently developed by the National University of Defense Technology in artificial intelligence fields, there is strong demand of high-performance convolution primitives for Phytium multi-core CPUs. This paper proposes a new high-performance parallel Winograd-based fast convolution algorithm after studying architecture characteristics of Phytium multi-core CPUs and computing characteristics of Winograd-based fast convolution algorithms. The new parallel algorithm does not rely on general matrix multiplication routines, and consists of four stages: kernels transformation, input feature maps transformation, element-wise multiplication, and output feature maps inverse transformation. The data movements in all four stages have been collaboratively optimized to improve memory access performance of the algorithm. The custom data layouts, multi-level parallel data transformation algorithms and multi-level parallel matrix multiplication algorithm have also been proposed to support the optimization above efficiently. The algorithm is tested on two Phytium multi-core CPUs. Compared with Winograd-based fast convolution implementations in ARM Computer Library (ACL) and NNPACK, the algorithm can achieve speedup of 1.05~16.11 times and 1.66~16.90 times, respectively. The application of the algorithm in the open source framework Mxnet improves the forward-propagation performance of the VGG16 network by 3.01~6.79 times.
    Related Articles | Metrics
    Efficient Optimization of Graph Computing on High-Throughput Computer
    Zhang Chenglong, Cao Huawei, Wang Guobo, Hao Qinfen, Zhang Yang, Ye Xiaochun, Fan Dongrui
    Journal of Computer Research and Development    2020, 57 (6): 1152-1163.   DOI: 10.7544/issn1000-1239.2020.20200115
    Abstract627)   HTML11)    PDF (1876KB)(348)       Save
    With the rapid development of computing technology, the scale of graph increases explosively and large-scale graph computing has been the focus in recent years. Breadth first search (BFS) is a classic algorithm to solve graph traverse problem. It is the main kernel of Graph500 benchmark that evaluates the performance of supercomputers and servers in terms of data-intensive applications. High-throughput computer (HTC) adopts ARM-based many-core architecture, which has the characteristics of high concurrency, strong real-time, low-power consumption. The optimization of BFS algorithm has made a series of progress on single-node systems. In this paper, we first introduce parallel BFS algorithm and existing optimizations. Then we propose two optimization techniques for HTC to improve the efficiency of data access and data locality. We systematically evaluate the performance of BFS algorithm on HTC. For the Kronecker graph with 2scale=230whose vertices are 230 and edges are 234, the average performance on HTC is 24.26 GTEPS and 1.18 times faster than the two-way x86 server. In terms of energy efficiency, the result on HTC is 181.04 MTEPS/W and rank 2nd place on the June 2019 Green Graph500 big data list. To our best knowledge, this is the first work that evaluates BFS performance on HTC platform. HTC is suitable for data intensive applications such as large-scale graph computing.
    Related Articles | Metrics
    Programming and Developing Environment for FPGA Graph Processing: Survey and Exploration
    Guo Jinyang, Shao Chuanming, Wang Jing, Li Chao, Zhu Haojin, Guo Minyi
    Journal of Computer Research and Development    2020, 57 (6): 1164-1178.   DOI: 10.7544/issn1000-1239.2020.20200106
    Abstract1509)   HTML17)    PDF (2346KB)(418)       Save
    Due to the advantages of high performance and efficiency, graph processing accelerators based on reconfigurable architecture field programmable gate array (FPGA) have attracted much attention, which satisfy complex graph applications with various basic operations and large-scale of graph data. However, efficient code design for FPGA takes long time, while the existing functional programming environment cannot achieve desirable performance. Thus, the problem of programming wall on FPGA is significant, and has become a serious obstacle when designing the dedicated accelerators. A well-designed programming environment is necessary for the further popularity of FPGA-based graph processing accelerators. A well-designed programming environment calls for convenient application programming interfaces, scalable application programming models, efficient high-level synthesis tools, and a domain-specific language that can integrate software/hardware features and generate high-performance underlying code. In this article, we make a systematic exploration of the programming environment for FPGA graph processing. We mainly introduce and analyze programming models, high-level synthesis, programming languages, and the related hardware frameworks. In addition, we also introduce the domestic and foreign development of FPGA-based graph processing accelerators. Finally, we discuss the open issues and challenges in this specific area.
    Related Articles | Metrics
    A Cross-Layer Memory Tracing Toolkit for Big Data Application Based on Spark
    Xu Danya, Wang Jing, Wang Li, Zhang Weigong
    Journal of Computer Research and Development    2020, 57 (6): 1179-1190.   DOI: 10.7544/issn1000-1239.2020.20200109
    Abstract604)   HTML12)    PDF (2108KB)(386)       Save
    Spark has been increasingly employed by industries for big data analytics recently, due to its efficient in-memory distributed programming model. Most existing optimization and analysis tool of Spark perform at either application layer or operating system layer separately, which makes Spark semantics separate from the underlying actions. For example, unknowing the impaction of operating system parameters on performance of Spark layer will lead unknowing of how to use OS parameters to tune system performance. In this paper, we propose SMTT, a new Spark memory tracing toolkit, which establishes the semantics of the upper application and the underlying physical hardware across Spark layer, JVM layer and OS layer. Based on the characteristics of Spark memory, we design the tracking scheme of execution memory and storage memory respectively. Then we analyze the Spark iterative calculation process and execution/storage memory usage by SMTT. The experiment of RDD memory assessment analysis shows our toolkit could be effectively used on performance analysis and provide guides for optimization of Spark memory system.
    Related Articles | Metrics
    Performance Optimization of Cache Subsystem in General Purpose Graphics Processing Units: A Survey
    Zhang Jun, Xie Jingcheng, Shen Fanfan, Tan Hai, Wang Lümeng, He Yanxiang
    Journal of Computer Research and Development    2020, 57 (6): 1191-1207.   DOI: 10.7544/issn1000-1239.2020.20200113
    Abstract421)   HTML9)    PDF (1220KB)(328)       Save
    With the development of process technology and the improvement of architecture, the parallel computing performance of GPGPU(general purpose graphics processing units) is updated a lot, which makes GPGPU applied more and more widely in the fields of high performance and high throughput. GPGPU can obtain high parallel computing performance, as it can hide the long latency incurred by the memory accesses via supporting thousands of concurrent threads. Due to the existance of irregular computation and memory access in some applications, the performance of the memory subsystem is affected a lot, especially the contention of the on-chip cache can become serious, and the performance of GPGPU can not be up to the maximum. Alleviating the contention and optimizing the performance of the on-chip cache have become one of the main solutions to the optimization of GPGPU. At present, the studies of the performance optimization of the on-chip cache focus on five aspects, including TLP(thread level parallelism) throttling, memory access reordering, data flux enhancement, LLC(last level cache) optimization, and new architecture design based on NVM(non-volatile memory). This paper mainly discusses the performance optimization research methods of the on-chip cache from these aspects. In the end, some interesting research fields of the on-chip cache optimization in future are discussed. The contents of this paper have important significance on the research of the cache subsystem in GPGPU.
    Related Articles | Metrics