ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2019, Vol. 56 ›› Issue (9): 1988-2000.doi: 10.7544/issn1000-1239.2019.20190048

• 系统结构 • 上一篇    下一篇


张晓阳1, 许佳豪1, 胡燏翀1,2   

  1. 1(华中科技大学计算机科学与技术学院 武汉 430074); 2(深圳华中科技大学研究院 广东深圳 518000) (
  • 出版日期: 2019-09-10
  • 基金资助: 

Proactive Locally Repairable Codes for Cloud Storage Systems

Zhang Xiaoyang1, Xu Jiahao1, Hu Yuchong1,2   

  1. 1(School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan 430074); 2(Shenzhen Huazhong University of Science and Technology Research Institute, Shenzhen, Guangdong 518000)
  • Online: 2019-09-10
  • Supported by: 
    This work was supported by the National Natural Science Foundation of China (61872414, 61502191) and Shenzhen Knowledge Innovation Program (JCYJ20170307172447622).

摘要: 为了保证客户访问数据的高可用性,一些云存储系统开始采用一类新型编码,即局部修复编码(locally repairable codes, LRC).例如Windows Azure和Facebook的HDFS RAID.与Reed-Solomon码相比,LRC修复效率高,因为它将每个条带的数据块分成多个组,每个组内额外生成一个校验块,因而组内就可以对单个故障块进行修复.LRC假设每组大小相同,这意味着每个故障块的修复所产生的组内数据传输量是相同的.但是,对于那些更易出现故障的磁盘,它们所造成丢失的数据块理应被系统更有效地修复.借助基于决策树的磁盘故障预测方法来动态调整LRC中组的大小,从而构造一类预测式LRC(proactive LRC, pLRC),使得即将发生故障的磁盘存储的数据块所在的组的长度变小,以便这些数据块可以在更小的组内进行更快地修复,同时保持和传统LRC相同的存储开销和编码结构.不仅通过MTTDL建模分析pLRC的可靠性,还在Facebook的Hadoop HDFS平台中实现了pLRC并进行了性能测试.结果表明,比起LRC,pLRC的可靠性最多可提升113%,同时降级读和磁盘修复性能最多可提高46.8%和47.5%.

关键词: 云存储, 局部修复码, 磁盘故障, 机器学习, 决策树

Abstract: Cloud storage systems, which provide customers the ability to access their data reliably, start to adopt a novel family of codes called locally reparable codes (LRC), e.g., Windows Azure Storage and Facebook’ HDFS RAID. Compared with Reed-Solomon codes, LRC is efficiently repairable since it divides the data blocks of each stripe into groups, each of which has an additional local parity block such that a failed block can be repaired locally in one group. LRC assumes that each group is equal-size which implies that each failed block is repaired from the same amount of data of a group. However, the blocks in the disks which are more likely to fail should be repaired more efficiently. In this paper, we present a proactive LRC (pLRC) via predicting disk failures and resizing the groups such that the recent failed disks can be repaired faster while maintaining the same storage overhead and code construction relative to LRC. We analyze pLRC through the reliability modeling of mean-time-to-data-loss (MTTDL) and also implement pLRC in Facebook’s HDFS. The results show that compared with LRC, pLRC’s reliability can be improved by up to 113%, and its degraded read and disk repair performance can be improved by up to 46.8% and 47.5%, respectively.

Key words: cloud storage, locally repairable codes (LRC), disk failures, machine learning, decision tree