ISSN 1000-1239 CN 11-1777/TP

Table of Content

01 June 2021, Volume 58 Issue 6
Agile Design of Processor Chips: Issues and Challenges
Bao Yungang, Chang Yisong, Han Yinhe, Huang Libo, Li Huawei, Liang Yun, Luo Guojie, Shang Li, Tang Dan, Wang Ying, Xie Biwei, Yu Wenjian, Zhang Ke, Sun Ninghui
2021, 58(6):  1131-1145.  doi:10.7544/issn1000-1239.2021.20210232
Asbtract ( 1044 )   HTML ( 51)   PDF (2065KB) ( 854 )  
Related Articles | Metrics
Design of processor chips currently relies on the performance-oriented design method that focuses on hybrid optimizations among chip frequency, area and power consumption with multi-step and repetitive iterations via modern electronic design automation (EDA) techniques. Such conventional methodology results in significant costs, long period and high technical threshold. In this paper, we introduce an object-oriented architecture (OOA) paradigm with the idea borrowed from the software engineering area, and propose an OOA-based agile processor design methodology. Unlike the conventional performance-oriented design method, the proposed OOA-based agile design method mainly aims to shorten the development cycle, and to reduce the cost and complexity without sacrificing performance and reliability, which is evaluated as a new metric, agile degree. OOA expects to implement a series of decomposable, composable, and extensible objects in architectures of both general-purpose CPUs and application-specific XPUs via the object-oriented design paradigm, language and EDA tools. We further summary the research progress in each technical field covered by OOA, and analyze the challenges that may arise in the future research of OOA-based agile design methodology.
A Proposal of Software-Hardware Decoupling Hardware Design Method for Brain-Inspired Computing
Qu Peng, Chen Jiajie, Zhang Youhui, Zheng Weimin
2021, 58(6):  1146-1154.  doi:10.7544/issn1000-1239.2021.20210170
Asbtract ( 728 )   HTML ( 61)   PDF (1130KB) ( 467 )  
Related Articles | Metrics
Brain-inspired computing is a novel research field involving multiple disciplines, which may have important implications for the development of computational neuroscience, artificial intelligence, and computer architectures. Currently, one of the key problems in this field is that brain-inspired software and hardware are usually tightly coupled. A recent study has proposed the notion of neuromorphic completeness and the corresponding system hierarchy design. This completeness provides a theoretical support for realizing the decoupling of hardware and software of brain-inspired computing systems, and the system hierarchy design can be viewed as a reference implementation of neuromorphic complete software and hardware. As a position paper, this article first discusses several key concepts of neuromorphic completeness and the system hierarchy for brain-inspired computing. Then, as a follow-up work, we propose a design method for software-hardware decoupling hardware design of brain-inspired computing, namely, an iterative optimization process consisting of execution primitive set design and hardware implementation evaluation. Finally, we show the preliminary status of our research on the FPGA based evaluation platform. We believe that this method would contribute to the realization of extensible, neuromorphic complete computation primitive sets and chips, which is beneficial to realize the decoupling of hardware and software in the field of brain-inspired computing systems.
Shenwei-26010: A High-Performance Many-Core Processor
Hu Xiangdong, Ke Ximing, Yin Fei, Zhao Xin, Ma Yongfei, Yan Shiyun, Ma Chao
2021, 58(6):  1155-1165.  doi:10.7544/issn1000-1239.2021.20201041
Asbtract ( 2682 )   HTML ( 24)   PDF (1621KB) ( 649 )  
Related Articles | Metrics
Based on the multi-core processor Shenwei 1600, the high-performance many-core processor Shenwei 26010 adopts SoC (system on chip) technology, and integrates 4 computing-control cores and 256 computing cores in a single chip. It adopts a 64-bit RISC (reduced instruction set computer) instruction set designed with an original design, and supports 256-bit SIMD (single instruction multiple data) integer and floating-point vector-acceleration operations. Its peak performance for double precision floating-point operations reaches 3.168TFLOPS. Shenwei 26010 processor is manufactured using 28nm process technology. The die area of the chip is more than 500mm\+2, and the 260 cores of the chip can run stably with a frequency of 1.5GHz. Shenwei 26010 processor adopts a variety of low power-consumption designs on the architecture level, the microarchitecture level, and the circuit level, and thus, leading to a peak energy-efficiency-ratio of 10.559GFLOPS/W. Notably, both the operating frequency and the energy-efficiency-ratio of the chip are higher than those of the worldwide contemporary processor products. Through the technical innovations of high frequency design, stable reliability design and yield design, Shenwei 26010 has effectively solved the issues of high frequency target, power consumption wall, stability and reliability, and yield, all of which are encountered when pursuing the goal of high-performance computing. It has been applied successfully to a 100PFLOPS supercomputer system named “Sunway TaihuLight” on a large scale, and therefore, can adequately meet the computing requirements for both scientific and engineering applications.
Design and Implementation of Configurable Cache Coherence Protocol for Multi-Core Processor
Chen Zhiqiang, Zhou Hongwei, Feng Quanyou, Deng Rangyu
2021, 58(6):  1166-1175.  doi:10.7544/issn1000-1239.2021.20210174
Asbtract ( 530 )   HTML ( 9)   PDF (1697KB) ( 404 )  
Related Articles | Metrics
In multi-core system, it is necessary to maintain the consistency of cache. Common cache coherence protocols can be divided into snoop-based protocol and directory-based protocol. Directory-based protocol has better scalability, lower latency and can be applied to more applications. According to the size of the directory, it can be divided into centralized directory and distributed directory. Distributed directory takes up less space and less time to inquiry. However, it’s hard to design and verify cache coherence based on distributed directory. To reduce the risk in designing CPU, a configurable distribute directory unit (CDDU) is proposed. It increases the flexibility and fault tolerance of the multi-core system by the way of changing state transformation and protocol flow. The special design can protect system from design defects that may lead to severe error, and it shows good performance in dealing with deadlock problems caused by cache coherence. It provides considerable fault-tolerance that can give the designer more freedom and opportunity. The simulation result indicates that it provides considerable scalability and prevents the occurrence of potential deadlock at the cost of subtle performance loss and area expense. The methodology mentioned in this paper has been used in the design of 64-core FT processor,which ensures the correctness of cache coherence protocol without totally modifying the initial design.Moreover, it improves the robustness of the protocol and eliminates the potential deadlock with a subtle impact on system performance.
A Real-Time Processor Model with Timing Semantics
Wang Chao, Chen Xianglan, Zhang Bo, Li Xi, Wang Chao, Zhou Xuehai
2021, 58(6):  1176-1191.  doi:10.7544/issn1000-1239.2021.20210157
Asbtract ( 475 )   HTML ( 4)   PDF (1362KB) ( 286 )  
Related Articles | Metrics
Real-time embedded system (RTES) is the core of calculation and control of safety-critical equipment. The software and hardware of RTES are required to have timing determinism and timing predictability to ensure the correctness of its time behavior. However, nearly every abstraction of modern computer systems has failed to provide timing semantics, which means it cannot meet the security design requirements of hard real-time systems. In this paper, we focus on the lack of timing semantics in the infrastructure of the instruction set architecture and try to redefine the instruction set and microarchitecture of RTES. First, we propose real-time machine (RTM), a real-time computer architecture model with timing semantics. Then, referring to the theory of time-triggered automata, we construct TTI, which is a timed instruction set, as the software/hardware interface of RTM. We also discuss the completeness of the timing semantics of TTI. Finally, we design and implement the real-time processing unit (RPU) and the timing determinism of RPU is obtained by comparing theoretical analysis with experimental results. The LET programming model is a real-time programming paradigm widely recognized by academia. In this article, we illustrate the effectiveness of RTM and TTI by giving an example of running LET tasks on RPU.
A High Performance Accelerator Design for Ultra-Long Point Floating-Point FFT
Wang Di, Shi Song, Wu Tiebin, Liu Liang, Tan Hongbing, Hao Ziyu, Guo Feng, Li Hongliang
2021, 58(6):  1192-1203.  doi:10.7544/issn1000-1239.2021.20210069
Asbtract ( 505 )   HTML ( 8)   PDF (3006KB) ( 268 )  
Related Articles | Metrics
Fast Fourier transform (FFT) plays a key role in digital signal processing. With the increasing demand of high performance ultra-long point FFT, digital signal processor (DSP) is becoming more and more difficult to meet the demand, so integrated FFT accelerators have become an important development trend. In order to support ultra-long point FFT, this paper extends the two-dimensional decomposition algorithm of FFT to multi-dimensional, and we propose a high performance ultra-long point FFT accelerator architecture which can be integrated into DSP. In this architecture, three-dimensional transposition operation is realized by using collision-free addressing method with prime number memory banks; efficient twiddle factor generation is realized by recursive algorithm; FFT operation circuit is refined by using single precision floating-point fused dot product and fused add-subtract operation. Finally, this paper realizes the single precision floating-point FFT calculation within 4G points. The synthesis result shows that the proposed FFT accelerator can run at a frequency of more than 1GHz and its performance can reach 640Gflop/s, which has been greatly improved in terms of points and performance compared with the existing research.
Survey on Graph Neural Network Acceleration Architectures
Li Han, Yan Mingyu, Lü Zhengyang, Li Wenming, Ye Xiaochun, Fan Dongrui, Tang Zhimin
2021, 58(6):  1204-1229.  doi:10.7544/issn1000-1239.2021.20210166
Asbtract ( 1648 )   HTML ( 70)   PDF (3278KB) ( 1582 )  
Related Articles | Metrics
Recently, the emerging graph neural networks (GNNs) have received extensive attention from academia and industry due to the powerful graph learning and reasoning capabilities, and are considered to be the core force that promotes the field of artificial intelligence into the “cognitive intelligence” stage. Since GNNs integrate the execution process of both traditional graph processing and neural network, a hybrid execution pattern naturally exists, which makes irregular and regular computation and memory access behaviors coexist. This execution pattern makes traditional processors and the existing graph processing and neural network acceleration architectures unable to cope with the two opposing execution behaviors at the same time, and cannot meet the acceleration requirements of GNNs. To solve the above problems, acceleration architectures tailored for GNNs continue to emerge. They customize computing hardware units and on-chip storage levels for GNNs, optimize computation and memory access behaviors, and have achieved acceleration effects well. Based on the challenges faced by the GNN acceleration architectures in the design process, this paper systematically analyzes and introduces the overall structure design and the key optimization technologies in this field from computation, on-chip memory access, off-chip memory access respectively. Finally, the future direction of GNN acceleration structure design is prospected from different angles, and it is expected to bring certain inspiration to researchers in this field.
DMR: An Out-of-Order Superscalar General-Purpose CPU Core Based on RISC-V
Sun Caixia, Zheng Zhong, Deng Quan, Sui Bingcai, Wang Yongwen, Ni Xiaoqiang
2021, 58(6):  1230-1233.  doi:10.7544/issn1000-1239.2021.20210176
Asbtract ( 568 )   HTML ( 3)   PDF (699KB) ( 300 )  
Related Articles | Metrics
DMR is a RISC-V based out-of-order superscalar general-purpose CPU core from the College of Computer Science and Technology, National University of Defense Technology. Three privilege levels, user-mode, supervisor-mode and machine-mode, are all supported, and the standard RISC-V RV64G instruction set is implemented. In addition, custom vector instructions are extended in DMR. Sv39 and Sv48 are supported for the virtual-memory system, and the size of physical address is 44-bit. The pipeline for single-cycle integer instructions is 12-stage in all. All instructions are executed out of program order and committed in program order. More than four instructions can be issued per cycle. Distributed schedule queues are used and at most 9 instructions can be out-of-order scheduled for executions in one cycle. Multi-layer, multi-platform functional verification method driven by functional coverage is used, and Linux OS is already booted on FPGA prototype system. DMR reaches 5.12CoreMarkMHz and targets 2GHz clock speed in 14nm technology.
A Self-Designed Heterogeneous Accelerator for Exascale High Performance Computing
Liu Sheng, Lu Kai, Guo Yang, Liu Zhong, Chen Haiyan, Lei Yuanwu, Sun Haiyan, Yang Qianming, Chen Xiaowen, Chen Shenggang, Liu Biwei, Lu Jianzhuang
2021, 58(6):  1234-1237.  doi:10.7544/issn1000-1239.2021.20210189
Asbtract ( 731 )   HTML ( 9)   PDF (885KB) ( 474 )  
Related Articles | Metrics
High performance computing (HPC) is one of the basic fields to promote the development of science and technology. Exascale HPC era, recognized as “the next crown of supercomputer”, is coming. The accelerator field for exascale HPC has gradually developed into the arena of the most high-end chips in the world. The international famous companies, such as AMD,NVIDIA and Intel, have occupied this field for several years. As one of the organizations which independently designed processors in China, National University of Defense Technology (NUDT) has always been a strong competitor in HPC accelerator field. This paper introduces an accelerator for exascale HPC which is self-designed by NUDT. It adopts a heterogeneous architecture with CPU and general purpose digital signal processor (GPDSP). It has the characteristics of high performance, high efficiency and high programmability, and is expected to be the key computing chip of our new exascale supercomputer system.
YHHX Agile Switching Chip for Mobile High-End Equipment
Yang Hui, Li Tao, Liu Rulin, Lü Gaofeng, Sun Zhigang
2021, 58(6):  1238-1241.  doi:10.7544/issn1000-1239.2021.20210169
Asbtract ( 499 )   HTML ( 1)   PDF (1772KB) ( 188 )  
Related Articles | Metrics
Centralized computing platforms integrate diverse heterogeneous resources and provide high computing power under limited conditions, which now becomes a research hotspot of the new generation of electronic information system. Existing commercial ethernet switching chips are designed for large-scale and high-performance networks such as data center, which cannot meet the requirements of centralized computing platform. In order to provide high-efficiency and low-power agile connection capability for mobile high-end equipment, the College of Computer Science and Technology, National University of Defense Technology proposes an end-to-end agile switching solution, and independently developes YHHX(Yin He Heng Xin)-DS40 agile switching chip, which is a powerful supplement to the existing domestic Ethernet switching chips. YHHX-DS40 integrates four 10 Gigabit Ethernet interfaces and four Gigabit Ethernet interfaces with full line-speed switching capability. Besides, application layer fine-grained switching and unified connection of heterogeneous resources are supported only with 1.6W typical power consumption.
HX-DS09: A Customized Low Power Time Sensitive Networking Chip for High-End Equipment
Quan Wei, Fu Wenwen, Sun Zhigang, Li Tao
2021, 58(6):  1242-1245.  doi:10.7544/issn1000-1239.2021.20210164
Asbtract ( 639 )   HTML ( 6)   PDF (1631KB) ( 267 )  
Related Articles | Metrics
As a new network technology that can provide high-bandwidth, high-deterministic transmission services, time sensitive networking (TSN) has received extensive attention and research from academia and industry in recent years. However, most domestic TSN devices currently use foreign TSN chips or hardware IP. There is no autonomous TSN chip available for network upgrading of core equipment. To this end, the OpenTSN team has developed a low-power TSN chip HX-DS09 for networks in high-end equipment. The chip can provide sub-microsecond synchronization accuracy, single-hop data transmission delay and jitter. HX-DS09 can work under endpoint, switching and switching endpoint mode, and its power consumption is less than 0.5W. It can satisfy the diversified deterministic networking needs of high-end equipment.
Energy Efficiency Evaluation Method of Data Centers for Cloud-Network Integration
Long Saiqin, Huang Jinna, Li Zhetao, Pei Tingrui, Xia Yuanqing
2021, 58(6):  1248-1260.  doi:10.7544/issn1000-1239.2021.20201069
Asbtract ( 419 )   HTML ( 6)   PDF (3412KB) ( 304 )  
Related Articles | Metrics
Cloud-network integration is developing at an accelerated pace, which not only promotes the rapid growth of data center scale, but also brings huge energy consumption. How to formulate reasonable data center energy efficiency evaluation standards has become a key issue that needs to be solved urgently to guide the improvement of data center energy efficiency. It is difficult to evaluate the energy efficiency of data centers comprehensively based on a single metric, and different data center energy efficiency metrics have their own focuses, and even contradict each other. It is proposed to integrate multiple metrics to evaluate the energy efficiency of data centers comprehensively. The model adopts a combination of subjective and objective weighting methods to set weights for different energy efficiency metrics. A multi-metric fusion evaluation strategy is designed based on the cloud model to obtain a more scientific and comprehensive data center energy efficiency evaluation result. Finally, the gray correlation method is proposed to analyze the relationship between the evaluation results and various energy efficiency metrics. The analysis results have important guiding significance for the improvement of data center energy efficiency.
Design of an Intelligent Routing Algorithm to Reduce Routing Flap
Shao Tianzhu, Wang Xiaoliang, Chen Wenlong, Tang Xiaolan, Xu Min
2021, 58(6):  1261-1274.  doi:10.7544/issn1000-1239.2021.20201073
Asbtract ( 495 )   HTML ( 6)   PDF (4082KB) ( 233 )  
Related Articles | Metrics
Recently, researchers have begun to focus on data-driven network protocol design methods to replace traditional protocol design methods that rely on human experts. While the resulting intelligent routing technology is rapidly developing, there are also problems to be solved urgently. This paper studies the large-scale routing flapping caused by the current intelligent routing algorithm in the routing update process and the resulting decrease in forwarding efficiency of network. A smart routing algorithm, named FSR(flap suppression routing), for route flapping suppression is proposed. While pursuing the uniform link load of the entire network and making full use of the forwarding resources of the entire network, FSR seeks an update plan that is most similar to the existing routing strategies. This reduces routing flapping in each routing update cycle, reduces route convergence time, and improves overall network forwarding efficiency. Experiments have shown that FSR algorithm can significantly improve the routing convergence speed, increase the network throughput by about 30% compared with the control algorithms, and significantly reduce the path length and the probability of congestion.
A Fine-Grained Multi-Access Edge Computing Architecture for Cloud-Network Integration
Wang Lu, Zhang Jianhao, Wang Ting, Wu Kaishun
2021, 58(6):  1275-1290.  doi:10.7544/issn1000-1239.2021.20201076
Asbtract ( 592 )   HTML ( 12)   PDF (4662KB) ( 405 )  
Related Articles | Metrics
Nowadays, a paradigm shift in mobile computing has been introduced by the ever-increasing heterogenous terminal devices, from the centralized mobile cloud towards the mobile edge. Multi-access edge computing (MEC) emerges as a promising ecosystem to support multi-service and multi-tenancy. It takes advantage of both mobile computing and wireless communication technologies for cloud-network integration. However, the physical hardware constraints of the terminal devices, along with the limited connection capacity of the wireless channel pose numerous challenges for cloud-network integration. The incapability of control over all the possible resources (e.g., computation, communication, cache) becomes the main hurdle of realizing delay-sensitive and real time services. To break this stalemate, this article investigates a software-defined fine-grained multi-access architecture, which takes full control of the computation and communication resources. We further investigate a Q-Learning based two-stage resource allocation strategy to better cater the heterogenous radio environments and various user requirements. We discuss the feasibility of the proposed architecture and demonstrate its effectiveness through extensive simulations.
Software Defined Virtualized Access Network Supporting Network Slicing and Green Communication
Wang Ting
2021, 58(6):  1291-1306.  doi:10.7544/issn1000-1239.2021.20201079
Asbtract ( 404 )   HTML ( 4)   PDF (5820KB) ( 202 )  
Related Articles | Metrics
Software defined networking (SDN) is disrupting the traditional networking industry by shifting network control from the physical network devices to the centralized software, thus facilitating scalability, flexibility and efficiency of the network. In the access networks, varied access technologies and massive number of access devices lead to dramatically increased OPEX, which forces operators to find feasible solutions to increase the revenue-expenditure ratio and achieve a sustainable business model. To deal with these challenges, this paper presents a new SDN-based architecture SDVAN for the access network, which provides cost-efficient network control and management with high scalability and customization support. The new architecture SDVAN abstracts control plane of all physical devices to one centralized controller which enables flexible customization of access network through software-defined fashion. The innovative node design implements a simple programmable node which naturally provides elastic support to various access technologies and efficiently improves the resource utilization. In order to automate orchestration of network services, resource modeling and network abstraction methodologies are introduced, which exposes different levels of visibility and controllability based on the trust level. Lastly, SDVAN well implements network slicing function supporting multi-tenancy and multi-version of network appliance. The theoretical analysis and experimental results prove the effectiveness and practicality of the proposed new architecture SDVAN.
Algorithm of Mixed Traffic Scheduling Among Data Centers Based on Prediction
Wang Ran, Zhang Yuchao, Wang Wendong, Xu Ke, Cui Laizhong
2021, 58(6):  1307-1317.  doi:10.7544/issn1000-1239.2021.20201087
Asbtract ( 462 )   HTML ( 5)   PDF (1020KB) ( 279 )  
Related Articles | Metrics
To handle the problem of low link utilization resulting from mixing online and offline traffic in one data center transmission network and separating them with a fix way in the same link, we propose a solution of offline traffic scheduling based on online traffic prediction. It firstly predicts online traffic needed to be guaranteed preferentially in link using an algorithm calling Sliding-k that combines EWMA and Bayesian changepoint detection algorithm. This customized algorithm can make prediction sensitive to a sudden change of network environment and reduce unnecessary re-adjustments when network environment is steady at the same time. Therefore, it can exactly meet the prediction demand under different network environments. After computing the remaining space for offline traffic according to online traffic prediction result and implementing dynamic bandwidth allocation, it uses an algorithm called SEDF that can consider both traffic deadline and size to schedule offline traffic. Experimental results reflect that Sliding-k can meet the prediction needs both when network mutation occurs and when network has no change and can simultaneously improve the accuracy of traditional EWMA algorithm. The combination of Sliding-k and SEDF can improve the utilization of data center links, so as to make full use of link resources.
Online Joint Optimization Mechanism of Task Offloading and Service Caching for Multi-Edge Device Collaboration
Zhang Qiuping, Sun Sheng, Liu Min, Li Zhongcheng, Zhang Zengqi
2021, 58(6):  1318-1339.  doi:10.7544/issn1000-1239.2021.20201088
Asbtract ( 548 )   HTML ( 10)   PDF (3553KB) ( 401 )  
Related Articles | Metrics
By deploying communication, computing and storage resources on the edge devices, mobile edge computing (MEC) can effectively overcome the problems of long transmission distance and high response delay of traditional cloud computing. Therefore, MEC can satisfy the service requirements of emerging computation-intensive and delay-sensitive applications. Nevertheless, the resources of edge devices are limited and the workload among multiple edge devices is unbalanced in MEC. In order to address the above problems, multi-edge device collaboration becomes an inevitable trend. However, multi-edge device collaboration faces two challenges. First, task offloading and service caching are mutually coupled. Second, the workload and resource state of the edge devices have the characteristics of spatial-temporal change. The two challenges significantly increase the difficulty of solving this issue. In response to the above challenges, this paper proposes the online joint optimizing mechanism of task offloading and service caching for multi-edge device collaboration. And we decouple the joint optimizing problem into two sub-problems of service caching and task offloading in this paper. For the service caching sub-problem, a collaborative service caching algorithm based on contextual combinatorial multi-armed bandit is proposed. For the task offloading sub-problem, a preference-based double-side matching algorithm is designed. Simulation results demonstrate that the proposed algorithm in this paper can efficiently reduce the overall execution delay of tasks, and realize workload balancing among edge devices.
Reversible Data Hiding of Image Encryption Based on Prediction Error Adaptive Coding
Yang Yaolin, He Hongjie, Chen Fan, Yuan Changqi
2021, 58(6):  1340-1350.  doi:10.7544/issn1000-1239.2021.20200172
Asbtract ( 377 )   HTML ( 1)   PDF (3788KB) ( 950 )  
Related Articles | Metrics
For the security problem of existing schemes in the image encryption, and the problem of low compression due to poor coding, this paper proposes a reversible data hiding algorithm of image encryption based on prediction error adaptive coding. In the image encryption stage, an image encryption algorithm based on error maintenance is designed. First, block scrambling and pixel modulation encryption are performed on 3×3 image blocks, and then non-center pixels are grouped and scrambled according to the central pixel value of the image block. In the data embedding stage, adaptive coding is based on the prediction error distribution of the image, after marking and classifying the pixels with the coding table, and the coding table and additional data are hidden together in the encrypted image to generate a marked encrypted image. The experimental results show that group scrambling operation in the encryption phase increases the number of eigenvalue difference blocks between the original image and the encrypted image, makes it difficult to determine the correspondence between the image blocks in the image before and after encryption, improves the security of the encrypted image, and keeps the overall prediction error distribution of the image. Compared with state-of-the-art algorithms, the average embedding rate can be improved by more than 0.49bpp, the additional data can be extracted losslessly and the original image can be restored.