ISSN 1000-1239 CN 11-1777/TP

    Default Latest Most Read
    Please wait a minute...
    For Selected: Toggle Thumbnails
    Journal of Computer Research and Development    2021, 58 (6): 1129-1130.   DOI: 10.7544/issn1000-1239.2021.qy0601
    Abstract649)   HTML292)    PDF (215KB)(575)       Save
    Related Articles | Metrics
    Agile Design of Processor Chips: Issues and Challenges
    Bao Yungang, Chang Yisong, Han Yinhe, Huang Libo, Li Huawei, Liang Yun, Luo Guojie, Shang Li, Tang Dan, Wang Ying, Xie Biwei, Yu Wenjian, Zhang Ke, Sun Ninghui
    Journal of Computer Research and Development    2021, 58 (6): 1131-1145.   DOI: 10.7544/issn1000-1239.2021.20210232
    Abstract1039)   HTML51)    PDF (2065KB)(854)       Save
    Design of processor chips currently relies on the performance-oriented design method that focuses on hybrid optimizations among chip frequency, area and power consumption with multi-step and repetitive iterations via modern electronic design automation (EDA) techniques. Such conventional methodology results in significant costs, long period and high technical threshold. In this paper, we introduce an object-oriented architecture (OOA) paradigm with the idea borrowed from the software engineering area, and propose an OOA-based agile processor design methodology. Unlike the conventional performance-oriented design method, the proposed OOA-based agile design method mainly aims to shorten the development cycle, and to reduce the cost and complexity without sacrificing performance and reliability, which is evaluated as a new metric, agile degree. OOA expects to implement a series of decomposable, composable, and extensible objects in architectures of both general-purpose CPUs and application-specific XPUs via the object-oriented design paradigm, language and EDA tools. We further summary the research progress in each technical field covered by OOA, and analyze the challenges that may arise in the future research of OOA-based agile design methodology.
    Related Articles | Metrics
    A Proposal of Software-Hardware Decoupling Hardware Design Method for Brain-Inspired Computing
    Qu Peng, Chen Jiajie, Zhang Youhui, Zheng Weimin
    Journal of Computer Research and Development    2021, 58 (6): 1146-1154.   DOI: 10.7544/issn1000-1239.2021.20210170
    Abstract726)   HTML61)    PDF (1130KB)(467)       Save
    Brain-inspired computing is a novel research field involving multiple disciplines, which may have important implications for the development of computational neuroscience, artificial intelligence, and computer architectures. Currently, one of the key problems in this field is that brain-inspired software and hardware are usually tightly coupled. A recent study has proposed the notion of neuromorphic completeness and the corresponding system hierarchy design. This completeness provides a theoretical support for realizing the decoupling of hardware and software of brain-inspired computing systems, and the system hierarchy design can be viewed as a reference implementation of neuromorphic complete software and hardware. As a position paper, this article first discusses several key concepts of neuromorphic completeness and the system hierarchy for brain-inspired computing. Then, as a follow-up work, we propose a design method for software-hardware decoupling hardware design of brain-inspired computing, namely, an iterative optimization process consisting of execution primitive set design and hardware implementation evaluation. Finally, we show the preliminary status of our research on the FPGA based evaluation platform. We believe that this method would contribute to the realization of extensible, neuromorphic complete computation primitive sets and chips, which is beneficial to realize the decoupling of hardware and software in the field of brain-inspired computing systems.
    Related Articles | Metrics
    Shenwei-26010: A High-Performance Many-Core Processor
    Hu Xiangdong, Ke Ximing, Yin Fei, Zhao Xin, Ma Yongfei, Yan Shiyun, Ma Chao
    Journal of Computer Research and Development    2021, 58 (6): 1155-1165.   DOI: 10.7544/issn1000-1239.2021.20201041
    Abstract2657)   HTML23)    PDF (1621KB)(644)       Save
    Based on the multi-core processor Shenwei 1600, the high-performance many-core processor Shenwei 26010 adopts SoC (system on chip) technology, and integrates 4 computing-control cores and 256 computing cores in a single chip. It adopts a 64-bit RISC (reduced instruction set computer) instruction set designed with an original design, and supports 256-bit SIMD (single instruction multiple data) integer and floating-point vector-acceleration operations. Its peak performance for double precision floating-point operations reaches 3.168TFLOPS. Shenwei 26010 processor is manufactured using 28nm process technology. The die area of the chip is more than 500mm\+2, and the 260 cores of the chip can run stably with a frequency of 1.5GHz. Shenwei 26010 processor adopts a variety of low power-consumption designs on the architecture level, the microarchitecture level, and the circuit level, and thus, leading to a peak energy-efficiency-ratio of 10.559GFLOPS/W. Notably, both the operating frequency and the energy-efficiency-ratio of the chip are higher than those of the worldwide contemporary processor products. Through the technical innovations of high frequency design, stable reliability design and yield design, Shenwei 26010 has effectively solved the issues of high frequency target, power consumption wall, stability and reliability, and yield, all of which are encountered when pursuing the goal of high-performance computing. It has been applied successfully to a 100PFLOPS supercomputer system named “Sunway TaihuLight” on a large scale, and therefore, can adequately meet the computing requirements for both scientific and engineering applications.
    Related Articles | Metrics
    Design and Implementation of Configurable Cache Coherence Protocol for Multi-Core Processor
    Chen Zhiqiang, Zhou Hongwei, Feng Quanyou, Deng Rangyu
    Journal of Computer Research and Development    2021, 58 (6): 1166-1175.   DOI: 10.7544/issn1000-1239.2021.20210174
    Abstract528)   HTML9)    PDF (1697KB)(399)       Save
    In multi-core system, it is necessary to maintain the consistency of cache. Common cache coherence protocols can be divided into snoop-based protocol and directory-based protocol. Directory-based protocol has better scalability, lower latency and can be applied to more applications. According to the size of the directory, it can be divided into centralized directory and distributed directory. Distributed directory takes up less space and less time to inquiry. However, it’s hard to design and verify cache coherence based on distributed directory. To reduce the risk in designing CPU, a configurable distribute directory unit (CDDU) is proposed. It increases the flexibility and fault tolerance of the multi-core system by the way of changing state transformation and protocol flow. The special design can protect system from design defects that may lead to severe error, and it shows good performance in dealing with deadlock problems caused by cache coherence. It provides considerable fault-tolerance that can give the designer more freedom and opportunity. The simulation result indicates that it provides considerable scalability and prevents the occurrence of potential deadlock at the cost of subtle performance loss and area expense. The methodology mentioned in this paper has been used in the design of 64-core FT processor,which ensures the correctness of cache coherence protocol without totally modifying the initial design.Moreover, it improves the robustness of the protocol and eliminates the potential deadlock with a subtle impact on system performance.
    Related Articles | Metrics
    A Real-Time Processor Model with Timing Semantics
    Wang Chao, Chen Xianglan, Zhang Bo, Li Xi, Wang Chao, Zhou Xuehai
    Journal of Computer Research and Development    2021, 58 (6): 1176-1191.   DOI: 10.7544/issn1000-1239.2021.20210157
    Abstract472)   HTML4)    PDF (1362KB)(283)       Save
    Real-time embedded system (RTES) is the core of calculation and control of safety-critical equipment. The software and hardware of RTES are required to have timing determinism and timing predictability to ensure the correctness of its time behavior. However, nearly every abstraction of modern computer systems has failed to provide timing semantics, which means it cannot meet the security design requirements of hard real-time systems. In this paper, we focus on the lack of timing semantics in the infrastructure of the instruction set architecture and try to redefine the instruction set and microarchitecture of RTES. First, we propose real-time machine (RTM), a real-time computer architecture model with timing semantics. Then, referring to the theory of time-triggered automata, we construct TTI, which is a timed instruction set, as the software/hardware interface of RTM. We also discuss the completeness of the timing semantics of TTI. Finally, we design and implement the real-time processing unit (RPU) and the timing determinism of RPU is obtained by comparing theoretical analysis with experimental results. The LET programming model is a real-time programming paradigm widely recognized by academia. In this article, we illustrate the effectiveness of RTM and TTI by giving an example of running LET tasks on RPU.
    Related Articles | Metrics
    A High Performance Accelerator Design for Ultra-Long Point Floating-Point FFT
    Wang Di, Shi Song, Wu Tiebin, Liu Liang, Tan Hongbing, Hao Ziyu, Guo Feng, Li Hongliang
    Journal of Computer Research and Development    2021, 58 (6): 1192-1203.   DOI: 10.7544/issn1000-1239.2021.20210069
    Abstract502)   HTML8)    PDF (3006KB)(267)       Save
    Fast Fourier transform (FFT) plays a key role in digital signal processing. With the increasing demand of high performance ultra-long point FFT, digital signal processor (DSP) is becoming more and more difficult to meet the demand, so integrated FFT accelerators have become an important development trend. In order to support ultra-long point FFT, this paper extends the two-dimensional decomposition algorithm of FFT to multi-dimensional, and we propose a high performance ultra-long point FFT accelerator architecture which can be integrated into DSP. In this architecture, three-dimensional transposition operation is realized by using collision-free addressing method with prime number memory banks; efficient twiddle factor generation is realized by recursive algorithm; FFT operation circuit is refined by using single precision floating-point fused dot product and fused add-subtract operation. Finally, this paper realizes the single precision floating-point FFT calculation within 4G points. The synthesis result shows that the proposed FFT accelerator can run at a frequency of more than 1GHz and its performance can reach 640Gflop/s, which has been greatly improved in terms of points and performance compared with the existing research.
    Related Articles | Metrics
    Survey on Graph Neural Network Acceleration Architectures
    Li Han, Yan Mingyu, Lü Zhengyang, Li Wenming, Ye Xiaochun, Fan Dongrui, Tang Zhimin
    Journal of Computer Research and Development    2021, 58 (6): 1204-1229.   DOI: 10.7544/issn1000-1239.2021.20210166
    Abstract1646)   HTML70)    PDF (3278KB)(1580)       Save
    Recently, the emerging graph neural networks (GNNs) have received extensive attention from academia and industry due to the powerful graph learning and reasoning capabilities, and are considered to be the core force that promotes the field of artificial intelligence into the “cognitive intelligence” stage. Since GNNs integrate the execution process of both traditional graph processing and neural network, a hybrid execution pattern naturally exists, which makes irregular and regular computation and memory access behaviors coexist. This execution pattern makes traditional processors and the existing graph processing and neural network acceleration architectures unable to cope with the two opposing execution behaviors at the same time, and cannot meet the acceleration requirements of GNNs. To solve the above problems, acceleration architectures tailored for GNNs continue to emerge. They customize computing hardware units and on-chip storage levels for GNNs, optimize computation and memory access behaviors, and have achieved acceleration effects well. Based on the challenges faced by the GNN acceleration architectures in the design process, this paper systematically analyzes and introduces the overall structure design and the key optimization technologies in this field from computation, on-chip memory access, off-chip memory access respectively. Finally, the future direction of GNN acceleration structure design is prospected from different angles, and it is expected to bring certain inspiration to researchers in this field.
    Related Articles | Metrics
    DMR: An Out-of-Order Superscalar General-Purpose CPU Core Based on RISC-V
    Sun Caixia, Zheng Zhong, Deng Quan, Sui Bingcai, Wang Yongwen, Ni Xiaoqiang
    Journal of Computer Research and Development    2021, 58 (6): 1230-1233.   DOI: 10.7544/issn1000-1239.2021.20210176
    Abstract567)   HTML3)    PDF (699KB)(296)       Save
    DMR is a RISC-V based out-of-order superscalar general-purpose CPU core from the College of Computer Science and Technology, National University of Defense Technology. Three privilege levels, user-mode, supervisor-mode and machine-mode, are all supported, and the standard RISC-V RV64G instruction set is implemented. In addition, custom vector instructions are extended in DMR. Sv39 and Sv48 are supported for the virtual-memory system, and the size of physical address is 44-bit. The pipeline for single-cycle integer instructions is 12-stage in all. All instructions are executed out of program order and committed in program order. More than four instructions can be issued per cycle. Distributed schedule queues are used and at most 9 instructions can be out-of-order scheduled for executions in one cycle. Multi-layer, multi-platform functional verification method driven by functional coverage is used, and Linux OS is already booted on FPGA prototype system. DMR reaches 5.12CoreMarkMHz and targets 2GHz clock speed in 14nm technology.
    Related Articles | Metrics
    A Self-Designed Heterogeneous Accelerator for Exascale High Performance Computing
    Liu Sheng, Lu Kai, Guo Yang, Liu Zhong, Chen Haiyan, Lei Yuanwu, Sun Haiyan, Yang Qianming, Chen Xiaowen, Chen Shenggang, Liu Biwei, Lu Jianzhuang
    Journal of Computer Research and Development    2021, 58 (6): 1234-1237.   DOI: 10.7544/issn1000-1239.2021.20210189
    Abstract725)   HTML9)    PDF (885KB)(467)       Save
    High performance computing (HPC) is one of the basic fields to promote the development of science and technology. Exascale HPC era, recognized as “the next crown of supercomputer”, is coming. The accelerator field for exascale HPC has gradually developed into the arena of the most high-end chips in the world. The international famous companies, such as AMD,NVIDIA and Intel, have occupied this field for several years. As one of the organizations which independently designed processors in China, National University of Defense Technology (NUDT) has always been a strong competitor in HPC accelerator field. This paper introduces an accelerator for exascale HPC which is self-designed by NUDT. It adopts a heterogeneous architecture with CPU and general purpose digital signal processor (GPDSP). It has the characteristics of high performance, high efficiency and high programmability, and is expected to be the key computing chip of our new exascale supercomputer system.
    Related Articles | Metrics
    YHHX Agile Switching Chip for Mobile High-End Equipment
    Yang Hui, Li Tao, Liu Rulin, Lü Gaofeng, Sun Zhigang
    Journal of Computer Research and Development    2021, 58 (6): 1238-1241.   DOI: 10.7544/issn1000-1239.2021.20210169
    Abstract492)   HTML1)    PDF (1772KB)(186)       Save
    Centralized computing platforms integrate diverse heterogeneous resources and provide high computing power under limited conditions, which now becomes a research hotspot of the new generation of electronic information system. Existing commercial ethernet switching chips are designed for large-scale and high-performance networks such as data center, which cannot meet the requirements of centralized computing platform. In order to provide high-efficiency and low-power agile connection capability for mobile high-end equipment, the College of Computer Science and Technology, National University of Defense Technology proposes an end-to-end agile switching solution, and independently developes YHHX(Yin He Heng Xin)-DS40 agile switching chip, which is a powerful supplement to the existing domestic Ethernet switching chips. YHHX-DS40 integrates four 10 Gigabit Ethernet interfaces and four Gigabit Ethernet interfaces with full line-speed switching capability. Besides, application layer fine-grained switching and unified connection of heterogeneous resources are supported only with 1.6W typical power consumption.
    Related Articles | Metrics
    HX-DS09: A Customized Low Power Time Sensitive Networking Chip for High-End Equipment
    Quan Wei, Fu Wenwen, Sun Zhigang, Li Tao
    Journal of Computer Research and Development    2021, 58 (6): 1242-1245.   DOI: 10.7544/issn1000-1239.2021.20210164
    Abstract634)   HTML6)    PDF (1631KB)(266)       Save
    As a new network technology that can provide high-bandwidth, high-deterministic transmission services, time sensitive networking (TSN) has received extensive attention and research from academia and industry in recent years. However, most domestic TSN devices currently use foreign TSN chips or hardware IP. There is no autonomous TSN chip available for network upgrading of core equipment. To this end, the OpenTSN team has developed a low-power TSN chip HX-DS09 for networks in high-end equipment. The chip can provide sub-microsecond synchronization accuracy, single-hop data transmission delay and jitter. HX-DS09 can work under endpoint, switching and switching endpoint mode, and its power consumption is less than 0.5W. It can satisfy the diversified deterministic networking needs of high-end equipment.
    Related Articles | Metrics