ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2020, Vol. 57 ›› Issue (2): 306-317.doi: 10.7544/issn1000-1239.2020.20190549

所属专题: 2020大数据与智能存储系统前沿技术专题

• 系统结构 • 上一篇    下一篇


杨洪章1, 杨雅辉1, 屠要峰2, 孙广宇3, 吴中海1   

  1. 1(北京大学软件与微电子学院 北京 102600);2(中兴通讯股份有限公司 广东深圳 518057);3(北京大学信息科学技术学院 北京 100871) (
  • 出版日期: 2020-02-01
  • 基金资助: 

Proactive Fault Tolerance Based on “Collection—Prediction—Migration—Feedback” Mechanism

Yang Hongzhang1, Yang Yahui1, Tu Yaofeng2, Sun Guangyu3, and Wu Zhonghai1   

  1. 1(School of Software & Microelectronics, Peking University, Beijing 102600);2(ZTE Corporation, Shenzhen, Guangdong 518057);3(School of Electronics Engineering and Computer Science, Peking University, Beijing 100871)
  • Online: 2020-02-01
  • Supported by: 
    This work was supported by the National Key Research and Development Program of China (2018YFB1003302), the National Natural Science Foundation of China(61672062), and the Jiangsu Provincial Program of Industrial & Information Industry Transformation (2018GX02517).

摘要: 硬盘故障是数据中心最主要的故障,严重影响了可靠性.传统的数据容错技术一般都是通过增加数据冗余来实现的,存在缺陷.主动容错技术通过预测硬盘故障提前将数据迁移,成为研究热点.现有技术大多研究硬盘故障预测,缺乏采集、迁移、反馈的研究,难以商用.提出“采集—预测—迁移—反馈”全流程主动容错机制,包括:分时硬盘信息采集方法、滑动窗口记录合并及样本构建方法、多类型硬盘故障预测方法、多盘联合数据迁移方法、预测结果二级验证及快速反馈方法.测试表明:采集硬盘信息对业务影响仅0.96%,硬盘故障预测召回率达94.66%,数据修复时间较传统方法减少55.10%.该工作已在中兴通讯的数据中心稳定商用,满足了主动容错技术在高可靠、高智能、低干扰、低成本、广适用等核心目标.

关键词: 硬盘故障, 存储可靠性, 容错, 人工智能, 运维

Abstract: Hard disk fault has become the main source of failure in data centers, which seriously affects the reliability of data. The traditional data fault tolerant technology is usually realized by increasing data redundancy, which has some shortcomings. Proactive fault tolerant technology has become a research hotspot, because it can predict hard disk failures and migrate data ahead of time. However, the existing technology mostly studies hard disk fault prediction, but lacks the research of collection, migration and feedback, which causes difficulty in commercialize. This paper proposes a whole process proactive fault tolerant on “Collection—Prediction—Migration—Feedback” mechanism, which includes time-sharing hard disk information collection method, sliding window record merging and sample building method, multi-type hard disk fault prediction method, multi-disk joint data migration method, and two-level validation of prediction results with fast feedback method. The test results show that the impact of collecting hard disk information on front-end thread is only 0.96%, the recall rate of hard disk fault prediction is 94.66%, and data repair time is 55.10% less than traditional methods. This work has been used stably in ZTE’s data center, which meets the objectives of proactive fault tolerance technology, such as high-reliability, high-intelligence, low-interference, low-cost and wide-application.

Key words: disk failure, storage reliability, fault tolerance, artificial intelligence, operation & maintenance