ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2020, Vol. 57 ›› Issue (11): 2419-2431.doi: 10.7544/issn1000-1239.2020.20190675

• 系统结构 • 上一篇    下一篇



  1. (华东师范大学数据科学与工程学院 上海 200062) (
  • 出版日期: 2020-11-01
  • 基金资助: 

Survey on Data Updating in Erasure-Coded Storage Systems

Zhang Yao, Chu Jiajia, Weng Chuliang   

  1. (School of Data Science and Engineering, East China Normal University, Shanghai 200062)
  • Online: 2020-11-01
  • Supported by: 
    This work was supported by the National Natural Science Foundation of China (61772204, 61732014).

摘要: 在分布式存储系统中,节点故障已成为一种常态,为了保证数据的高可用性,系统通常采用数据冗余的方式.目前主要有2种冗余机制:一种是多副本,另一种是纠删码.伴随着数据量的与日俱增,多副本机制带来的效益越来越低,人们逐渐将目光转向存储效率更高的纠删码.但是纠删码本身的复杂规则导致使用纠删码的分布式存储系统的读、写、更新操作的开销相比于多副本较大.所以纠删码通常被用于冷数据或者温数据的存储,热数据这种需要频繁访问更新的场景仍然用多副本机制存储.专注于纠删码存储系统内的数据更新,从硬盘I/O、网络传输、系统优化3方面综述了目前纠删码更新相关的优化工作,对目前具有代表性的编码方案的更新性能做了对比分析,最后展望了未来研究趋势.通过分析发现目前的纠删码更新方案仍然无法获得和多副本相近的更新性能.如何在纠删码更新规则和系统架构角度优化纠删码存储系统,使其能够替换掉热数据场景下的多副本机制,降低热数据存储开销仍是未来值得深入研究的问题.

关键词: 纠删码, 分布式存储系统, 数据更新, 多副本, 存储开销

Abstract: In a distributed storage system, node failure has become a normal state. In order to ensure high availability of data, the system usually adopts data redundancy. At present, there are mainly two kinds of redundancy mechanisms. One is multiple replications, and the other is erasure coding. With the increasing amount of data, the benefits of the multi-copy mechanism are getting lower and lower, and people are turning their attention to erasure codes with higher storage efficiency. However, the complicated rules of the erasure coding itself cause the overhead of the read, write, and update operations of the distributed storage systems using the erasure coding to be larger than that of the multiple copies. Therefore, erasure coding is usually used for cold data or warm data storage. Hot data, which requires frequent access and update, is still stored in multiple copies. This paper focuses on the data update in erasure-coded storage systems, summarizes the current optimization work related to erasure coding update from the aspects of hard disk I/O, network transmission and system optimization, makes a comparative analysis on the update performance of representative coding schemes at present, and finally looks forward to the future research trends. Through analysis, it is concluded that the current erasure coding update schemes still cannot obtain the update performance similar to that of multiple copies. How to optimize the erasure-coded storage system in the context of erasure coding update rules and system architecture, so that it can replace the multi-copy mechanism under the hot data scenario, and reducing the hot data storage overhead is still a problem worthy of further study in the future.

Key words: erasure codes, distributed storage systems, data update, multiple copies, storage overhead