ISSN 1000-1239 CN 11-1777/TP

Table of Content

15 June 2005, Volume 42 Issue 6
Research on High Performance Computer Technology Based on InfiniBand
Xie Xianghui, Peng Longgen, Wu Zhibing, and Lu Deping
2005, 42(6):  905-912. 
Asbtract ( 449 )   HTML ( 0)   PDF (523KB) ( 587 )  
Related Articles | Metrics
The network performance is the key bottleneck which always restricts the development of high performance computing technology, whether computing network or storage network, the progress of communication is lag behind that of CPU. InfiniBand interconnection architecture can fill the performance gap in network and CPU, and make the performance of high performance computing to balance in computing and communication. For developing high performance interconnection components of HPC, the research on InfiniBand specification began in 2000, and the InfiniBand network products branded with “SunWay” were worked out in 2003. Discussed in this paper are the components, archtecture and applications of the high performance computing system based on the InfiniBand technology, and then the result of performance test is shown.
CC-NUMA Architecture Based IO System Design
Wu Jiqing, Liu Hengzhu, and Wang Haitao
2005, 42(6):  913-917. 
Asbtract ( 567 )   HTML ( 0)   PDF (259KB) ( 466 )  
Related Articles | Metrics
In CC-NUMA architecture that is widely adopted by high performance calculation, IO resources are distributed among the nodes. They are managed and maintained dispersedly. This kind of organization is subject to some latent troubles. First, the troubles are analyzed and then a new organization of IO resources in CC-NUMA system is put forward, based on some available techniques of IO bus and storage network. In addition, the design details of key modules are depicted. Finally, the design with the system platform is verified and the outcome proves that the new IO system is efficient.
gDevice: A Protocol for the Grid-Enabling of the Computer Peripherals
Zhang Yuedong, Yang Yi, Fan Jianping, and Ma Jie
2005, 42(6):  918-923. 
Asbtract ( 449 )   HTML ( 0)   PDF (337KB) ( 467 )  
Related Articles | Metrics
Grid computer is one of the trends of the future computer architectures, and the grid-enabled components are the key elements of the grid computer system. The main characteristics of the grid-enabled component are grid entity, functional service and intelligent interconnection, and the key issues in the grid-enabling technology of the computer components include device description, interconnection, resource sharing and multiplexing, security etc. The gDevice protocol is a protocol intending for the grid-enabling of the computer peripherals. The protocol has been partly validated in a grid computer console system called grid console.
SEA: A High-Performance Modular Long Integer Exponentiation Coprocessor
Zhao Xuemi, Lu Hongyi, Dai Kui, Tong Yuanman, and Wang Zhiying
2005, 42(6):  924-929. 
Asbtract ( 416 )   HTML ( 0)   PDF (387KB) ( 753 )  
Related Articles | Metrics
Modular exponentiation of long integers is the primary operation of several public-key algorithms and often the bottleneck for implementation. A high-performance modular exponentiation coprocessor, SEA, is presented here, and three novel ways are employed. First, a parallel binary modular exponentiation algorithm (PBME) is used to decrease cycles, and a high radix Montgomery modular multiplication algorithm is modified to the radix based high radix Montgomery modular multiplication algorithm (RBHRMMM) to increase the frequency; second when mapping algorithms to a systolic array, modular square and modular multiplication are alternatively computed to cover up the dependencies between iterations in the RBHRMMM algorithm and the bypass is used to eliminate the dependencies in the PBME algorithm; third, multipliers are split first, and then accumulations are compressed as partial products to decrease carry propagation delay in the critical path. The SEA can do a full 1024-bit modular exponentiation in 72738 cycles and is implemented based on standard cells, its die area being 4.2×4.2mm\+2 which equals 739933 gates. Now the SEA has been taped out successfully in 0.18μm 1P6M CMOS technology, the working frequency of SEA is 133MHz, the power is 962.26mW, and a 1024-bit RSA signature can be finished in 316.9μs with SEA.
An Implementation of Reconfigurable Computing Accelerator Card Oriented Bioinformatics
Zhang Peiheng, Liu Xinchun, and Jiang Xianyang
2005, 42(6):  930-937. 
Asbtract ( 437 )   HTML ( 0)   PDF (480KB) ( 656 )  
Related Articles | Metrics
After the completion of human genome sequencing, the biologists require higher processing and analysis power to handle the huge gene data. Computing is a basic research method of bioinformatics, many bioinformatics programs have some common features, such as huge data volume, relative simple algorithm, few operation types, many repeating processes, showing that these programs are potentially parallelizable. When running in a general computer, these programs not only waste a lot of system resources, but also need complex maintenance. However, a lot of program still couldn't get a satisfying result within limited time. A kind of general algorithm-reconfigurable hardware accelerator architecture is presented, the principle of how to map the global Smith-Waterman algorithm to the hardware is discussed and its possible applications in other fields are pointed out.
Parallel Modeling for Line Speed Approximate Content-Based Packet Classification
Li Xudong, Xu Yang, Li Jing, and Liu Bin
2005, 42(6):  938-944. 
Asbtract ( 281 )   HTML ( 1)   PDF (378KB) ( 498 )  
Related Articles | Metrics
A parallel and pipeline hardware scheme is proposed for approximate content-based packet classification, which is scalable for large rule set and high-speed network rate. With the employment of configurable window unit, the error level of approximate matching can be flexibly adjusted. Furthermore, various kinds of approximate matching errors (insertion, deletion, substitution, transposition) can be detected with different structures of rule combination unit. A probability model of packet matched is also proposed for large alphabet (Chinese char) environment, which proves that the hardware scheme is practicable.
A New Rendering Technology of GPU-Accelerated Radiosity
Hu Wei and Qin Kaihuai
2005, 42(6):  945-950. 
Asbtract ( 493 )   HTML ( 0)   PDF (407KB) ( 511 )  
Related Articles | Metrics
A new rendering technology of GPU-accelerated radiosity is presented in this paper. Exploiting parallel computation power of current graphics hardware, the method implements entire classical radiosity solution on GPU without participation of CPU. Using new OpenGL extensions to realize texture traverse, classification and accumulation, the rendering results of hemi-cube method can be used directly on GPU. New Jacobi iteration solution is also proposed based on the matrix and vector representations on GPUs.
A Hardware-Based PATRICIA Algorithm for Fixed-Length Match
Li Xin, Hu Mingzeng, and Ji Zhenzhou
2005, 42(6):  951-957. 
Asbtract ( 473 )   HTML ( 0)   PDF (410KB) ( 505 )  
Related Articles | Metrics
PATRICIA algorithm has become a classic method for information retrieval. But PATRICIA insertion is time-consuming. By analyzing PATRICIA, it is discovered that not keeping the order of NBTs(next bit to test) in PATRICIA trie can improve the performance of PATRICIA insertion and decrease hardware design complexity. A new PATRICIA algorithm for fixed-length match is proposed. It is proved that this algorithm is an optimal binary trie-based algorithm. An ASIC (application specific integrated circuit) for this algorithm is implemented for the application of state table of stateful inspection. The theoretical and experimental results show that this algorithm can work very well for the application of state table in gigabit network.
Design of System Area Network Adapter
Yang Xiaojun, Zhang Peiheng, Miao Yanchao, Sun Ninghui, and Guo Lili
2005, 42(6):  958-964. 
Asbtract ( 391 )   HTML ( 1)   PDF (404KB) ( 483 )  
Related Articles | Metrics
An effective system area network (SAN) adapter is critical to the achievement of a high-performance cluster system. The design of SAN adapter based on the Intel IOP310 I/O processor chipset, a universal embedded system, is proposed in this paper. It is a part of DCNet, which is the SAN of Dawning 4000A cluster. In the adapter architecture, the memory bus is extended to be a local bus for system peripheral interconnects, and a network interface unit (NIU) based on the local bus is implemented and embedded. All these innovations not only thoroughly compensate for the lack of high-performance data channel in the embedded system, but also efficiently utilize the memory bus bandwidth and DMA engine to reduce the latency for data transfer between the host and network. Furthermore, the Intel IOP310 I/O processor chipset makes it powerful for the adapter to offload the processing of communication protocol from the host CPU. The testing results show that the adapter obtains competitive communication performance compared with Myrinet, SCI, and QsNet, and prove that the way to design a high-performance adapter based on embedded system is feasible and effective.
Using Multi-Stage Switch Fabric in High Performance Router Design
Guan Jianbo, Sun Zhigang, and Lu Xicheng
2005, 42(6):  965-970. 
Asbtract ( 431 )   HTML ( 0)   PDF (361KB) ( 574 )  
Related Articles | Metrics
Traditional single-stage switch architectures cannot scale up well, so multi-stage architectures are widely considered in large scale switching fabric designs. Topology and packet routing style of multi-stage switching fabrics influence the performance heavily. Based on the comparison of several popular k-ary n-cube structures used in MPP systems it is argued that the 3D Torus network is most suitable for implementing large switching fabrics. Then a novel routing algorithm DMR is proposed. It can achieve high throughput and high availability by balancing traffic loads on multiple paths while at the same time it can maintain packets order in one flow. The performance of DMR routing algorithm is studied using a simulation approach and is compared with two other routing algorithms, the e-cube routing and the random routing. The results show that the performance of the DMR algorithm is almost the same as that of the random routing and much better than the e-cube routing. At the same time the DMR algorithm can maintain the packets order in one flow while the random routing cannot.
Parallel Communication Protocol Based on Smart NICs
Lin Ji, Zhou Xiaocheng, and Meng Dan
2005, 42(6):  971-978. 
Asbtract ( 447 )   HTML ( 0)   PDF (404KB) ( 485 )  
Related Articles | Metrics
As an important part of a cluster, the performance of the communication system is one of the most critical factors determining the performance of a whole cluster system. With the enhancing of a single node's computing capability, the communication capability of network needs to be improved corresponsively. An important method to enhancing capability of communication is using multiple cards to deal with messages at the same time. In this paper, an implementation of parallel communication based on smart NICs is presented and evaluated by both communication benchmarks and applications. The experimental results show that both performance of communication and applications is better than parallel communication based on RMA mechanism.
Fully Integrated Cluster Operating System: Phoenix
Meng Dan, Zhan Jianfeng, Wang Lei, Tu Bibo, and Zhang Zhihong,
2005, 42(6):  979-986. 
Asbtract ( 375 )   HTML ( 0)   PDF (488KB) ( 490 )  
Related Articles | Metrics
This paper defines a complete layered architecture of cluster system software named Phoenix from a perspective of operating system's view, including three layers: heterogeneous resources, cluster operating system kernel and user environment. According to the core requirement of different user environment, the components of cluster operating system kernel and especially their correlations are described and defined. The scalable and fault-tolerant characteristic of cluster OS is guaranteed on the basis of the improvement of group structure. The Phoenix system has been installed on Dawning 4000A used for system monitoring, system administration and job management, and presently it supports Linux, AIX, Windows and Solaris operating system.
Implementation of Checkpoint System Towards Large Scale Parallel Computing
Zhou Enqiang, Lu Yutong, and Shen Zhiyu
2005, 42(6):  987-992. 
Asbtract ( 508 )   HTML ( 0)   PDF (316KB) ( 722 )  
Related Articles | Metrics
As high-performance clusters continue to grow in size and popularity, issues of fault tolerance and reliability are becoming limiting factors on parallel computing. Two bottlenecks, checkpointing protocol overhead and storage cost of checkpoint image, limit the scalability of checkpoint system, which is critical to large-scale clusters. To address these issues, the design of C system is presented which provides coordinated checkpointing based on dynamic virtual connection and distributed checkpoint image storage for MPI-based parallel applications. Full use is made of some characteristics of parallel applications and capability of local disks of cluster system to reduce checkpointing cost of large scale parallel job. C system is suitable to large scale cluster and initial experimental results show negligible performance impact due to the incorporation of the mechanism into the C system implemented on the cluster testbed.
DCFT-Kernel: A Fault-Tolerant Cluster Middleware Based on Group Service
Huang Wei, Zhan Jianfeng, and Fan Jianpin
2005, 42(6):  993-999. 
Asbtract ( 460 )   HTML ( 1)   PDF (341KB) ( 509 )  
Related Articles | Metrics
Being highly available and fault-tolerant is one of the most important factors that are used for evaluating cluster system. But with the scale of cluster system becoming more and more larger, how to implement system software for fault-tolerant management in cluster becomes a difficult technical problem. In this paper, the group services method is put forward to resolve the problem of high scalability and high availability when implementing fault-tolerant management software. The main idea of group services is to divide the cluster system into several small partitions and let every partition being fault-tolerant upon that the whole system can be fault-tolerant. Using group services technology together with real-time event service technology, the fault-tolerant management system software, named DCFT-Kernel, is implemented in the DAWNING-4000A cluster system. In this paper, emphasis is put on describing the group services technology, but an introduction to DCFT-Kernel is also provided. Furthermore. some performance evaluations are also given in the paper.
LUNF—A Cluster Job Scheduling Strategy Using Characterization of Nodes' Failure
Wu Linping, Meng Dan, Liang Yi, Tu Bibo, and Wang Lei
2005, 42(6):  1000-1005. 
Asbtract ( 489 )   HTML ( 0)   PDF (346KB) ( 525 )  
Related Articles | Metrics
Owing to the outstanding scalability of cluster systems, the demand of high performance can be easily met by increasing the number of nodes. But, with the scale of cluster system expanding, node failures become a commonplace feature of such large-scale systems. New ways are needed to accommodate the occurrence of node failure. As an important part of cluster operating system software, job scheduling completes the task of high efficient resource management and reasonable job scheduling. The function of job scheduling in cluster system is divided into two sub-processes: strategy of job s