Abstract:
Cloud storage systems, which provide customers the ability to access their data reliably, start to adopt a novel family of codes called locally reparable codes (LRC), e.g., Windows Azure Storage and Facebook’ HDFS RAID. Compared with Reed-Solomon codes, LRC is efficiently repairable since it divides the data blocks of each stripe into groups, each of which has an additional local parity block such that a failed block can be repaired locally in one group. LRC assumes that each group is equal-size which implies that each failed block is repaired from the same amount of data of a group. However, the blocks in the disks which are more likely to fail should be repaired more efficiently. In this paper, we present a proactive LRC (pLRC) via predicting disk failures and resizing the groups such that the recent failed disks can be repaired faster while maintaining the same storage overhead and code construction relative to LRC. We analyze pLRC through the reliability modeling of mean-time-to-data-loss (MTTDL) and also implement pLRC in Facebook’s HDFS. The results show that compared with LRC, pLRC’s reliability can be improved by up to 113%, and its degraded read and disk repair performance can be improved by up to 46.8% and 47.5%, respectively.