• 中国精品科技期刊
  • CCF推荐A类中文期刊
  • 计算领域高质量科技期刊T1类
Advanced Search
Huang Yongqin, Jin Lifeng, and Liu Yao. Current Situation and Trend of Reliability Technology in High Performance Computers[J]. Journal of Computer Research and Development, 2010, 47(4): 589-594.
Citation: Huang Yongqin, Jin Lifeng, and Liu Yao. Current Situation and Trend of Reliability Technology in High Performance Computers[J]. Journal of Computer Research and Development, 2010, 47(4): 589-594.

Current Situation and Trend of Reliability Technology in High Performance Computers

More Information
  • Published Date: April 14, 2010
  • As the system performance of high performance computers (HPC) becomes higher and higher and its hardware scale continuously increases, how to realize highly reliable operation of the system is a great challenge in tera-scale and peta-scale HPC research and development. Beginning with the requirement for high reliability technology from HPC, the authors completely introduce the present reliability technologies in HPC hardware design, such as fault avoidance, static redundancy, dynamic redundancy, and online replacement, in which static redundancy includes such fault masking technologies as part redundancy, data path redundancy and information redundancy, and dynamic redundancy includes such reliability technologies as fault detection and diagnosis, reconstruction and recovery. Combined with online replacement technology, redundancy technology can greatly improve system RAS (reliability, availability, serviceability). Detailedly analyzed is the specific application of all kinds of reliability technologies in typical IBM, HP and Cray systems. Finally discussed is the future trend of reliability technology in peta-scale HPC, suggesting that in the development of peta-scale high performance computers, much work should focus on reliability design of multi-core processor and the all-round memory protection, and it is pointed out that blade architecture is beneficial to the realization of modularizational redundancy and online replacement of components.
  • Related Articles

    [1]Li Zeyu, Wang Quan, Yang Pengfei, Xu Zhiwei, Liang Jinpeng, Gao Ge. FPGA Fault Tolerance Based on Dynamic Self-Adaptive Redundancy[J]. Journal of Computer Research and Development, 2022, 59(7): 1428-1438. DOI: 10.7544/issn1000-1239.20210181
    [2]Wang Yuqing, Yang Qiusong, Li Mingshu. A Cache Replacement Policy Based on Instruction Flow Access Pattern Prediction[J]. Journal of Computer Research and Development, 2022, 59(1): 31-46. DOI: 10.7544/issn1000-1239.20200503
    [3]Zhang Lei, Li Lin, Chen Honglong, Daniel Bovensiepen. A Cache Replacement Algorithm for Industrial Edge Computing Application[J]. Journal of Computer Research and Development, 2021, 58(7): 1533-1543. DOI: 10.7544/issn1000-1239.2021.20200672
    [4]Wang Yonggong, Li Zhenyu, Wu Qinghua, Xie Gaogang. Performance Analysis and Optimization for In-Network Caching Replacement in Information Centric Networking[J]. Journal of Computer Research and Development, 2015, 52(9): 2046-2055. DOI: 10.7544/issn1000-1239.2015.20140101
    [5]Jia Jia, Yang Xuejun, Li Zhiling. A Redundancy-Multithread-Based Multiple GPU Copies Fault-Tolerance Technique[J]. Journal of Computer Research and Development, 2013, 50(7): 1551-1562.
    [6]Yang Lianghuai, Zhou Jian, Gong Weihua, Chen Lijun. Energy-Efficient Replacement Schemes for Heterogeneous Drive[J]. Journal of Computer Research and Development, 2013, 50(1): 19-36.
    [7]Lin Junmin, Wang Wei, Qiao Lin, and Tang Zhizhong. A Cache Replacement Policy Based on Reuse Distance Prediction and Stream Detection[J]. Journal of Computer Research and Development, 2012, 49(5): 1049-1060.
    [8]Chen Kunjie, Sun Weiwei, Zhu Liang, and Liu Weimo. An Adaptive Page-Replacement Strategy for Spatial Database Systems[J]. Journal of Computer Research and Development, 2011, 48(10): 1927-1934.
    [9]Liu Ying, Zhang Yichuan, Zhang Bin, Zhang Mingwei, Zhu Zhiliang. Analysis of Service Replaceability on Behavior Effect[J]. Journal of Computer Research and Development, 2010, 47(8): 1442-1449.
    [10]Guo Ruijie, Cheng Xueqi, Xu Hongbo, Wang Bin, Ding Guodong. A Fast On-Line Index Construction Method Based on Dynamic Balancing Tree[J]. Journal of Computer Research and Development, 2008, 45(10): 1769-1775.

Catalog

    Article views (1094) PDF downloads (877) Cited by()

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return