ISSN 1000-1239 CN 11-1777/TP

Journal of Computer Research and Development ›› 2020, Vol. 57 ›› Issue (2): 306-317.doi: 10.7544/issn1000-1239.2020.20190549

Special Issue: 2020大数据与智能存储系统前沿技术专题

Previous Articles     Next Articles

Proactive Fault Tolerance Based on “Collection—Prediction—Migration—Feedback” Mechanism

Yang Hongzhang1, Yang Yahui1, Tu Yaofeng2, Sun Guangyu3, and Wu Zhonghai1   

  1. 1(School of Software & Microelectronics, Peking University, Beijing 102600);2(ZTE Corporation, Shenzhen, Guangdong 518057);3(School of Electronics Engineering and Computer Science, Peking University, Beijing 100871)
  • Online:2020-02-01
  • Supported by: 
    This work was supported by the National Key Research and Development Program of China (2018YFB1003302), the National Natural Science Foundation of China(61672062), and the Jiangsu Provincial Program of Industrial & Information Industry Transformation (2018GX02517).

Abstract: Hard disk fault has become the main source of failure in data centers, which seriously affects the reliability of data. The traditional data fault tolerant technology is usually realized by increasing data redundancy, which has some shortcomings. Proactive fault tolerant technology has become a research hotspot, because it can predict hard disk failures and migrate data ahead of time. However, the existing technology mostly studies hard disk fault prediction, but lacks the research of collection, migration and feedback, which causes difficulty in commercialize. This paper proposes a whole process proactive fault tolerant on “Collection—Prediction—Migration—Feedback” mechanism, which includes time-sharing hard disk information collection method, sliding window record merging and sample building method, multi-type hard disk fault prediction method, multi-disk joint data migration method, and two-level validation of prediction results with fast feedback method. The test results show that the impact of collecting hard disk information on front-end thread is only 0.96%, the recall rate of hard disk fault prediction is 94.66%, and data repair time is 55.10% less than traditional methods. This work has been used stably in ZTE’s data center, which meets the objectives of proactive fault tolerance technology, such as high-reliability, high-intelligence, low-interference, low-cost and wide-application.

Key words: disk failure, storage reliability, fault tolerance, artificial intelligence, operation & maintenance

CLC Number: