ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2018, Vol. 55 ›› Issue (7): 1557-1568.doi: 10.7544/issn1000-1239.2018.20160915

• 软件技术 • 上一篇    下一篇

一种基于分布式存储系统中多节点修复的节点选择算法

刘佩1,2,蒋梓逸1,曹袖1,2   

  1. 1(复旦大学计算机科学与技术学院 上海 201203); 2(网络信息安全审计与监控教育部工程研究中心(复旦大学) 上海 200433) (13210240020@fudan.edu.cn)
  • 出版日期: 2018-07-01

Node Selection Algorithm During Multi-Nodes Repair Progress in Distributed Storage System

Liu Pei1,2, Jiang Ziyi1, Cao Xiu1,2   

  1. 1(School of Computer Science and Technology, Fudan University, Shanghai 201203); 2(Engineering Research Center of Cyber Security Auditing and Monitoring (Fudan University), Ministry of Education, Shanghai 200433)
  • Online: 2018-07-01

摘要: 在分布式存储系统中,如何优化失效数据的修复时间以保证系统的高可靠性,已引起了人们的广泛关注.近几年的研究发现修复过程中不同的节点选择机制对数据的再生时间产生很大的影响,已有工作提出了单节点失效场景下的节点选择SPSN (select provider select newcomer)算法,系统中往往存在多个节点同时修复的情况,此时,SPSN算法巨大的时空开销使得数据的再生时间不再最优.对已有真实系统的失效数据及原因进行统计;基于已有算法特点和修复模型,提出了具有更优的多节点选择B-WSJ(bandwidth based weak and strong judgement)算法.为了更好地描述算法,对带宽中节点的关系进行分类,算法利用节点关系分别实现了修复模型中目标节点的浅度和深度判断,并加入一定的预处理和剪枝策略,最终快速选择出具有较优带宽的节点集合.为了评估B-WSJ算法性能,使用Waxman算法产生网络拓扑,依据FTA(failure trace archive)网站所给的真实系统的节点失效模型进行多次实验,仿真结果表明:B-WSJ算法使得节点修复性能得到了很大的提升.

关键词: 分布式存储系统, 数据修复, 再生时间, 多节点失效, 节点选择

Abstract: In distributed storage systems, how to optimize the regeneration time of lost data so as to keep high reliability has attracted attention increasingly. Recent researches reveal that node selection mechanism during repair progress has great impact on regeneration time. SPSN (select provider select newcomer) algorithm has put forward by some studies, which suits the scenario of single node failure well. However, it is very common to repair many modes at the same time in actual system. In this scenario, SPSN algorithm will no longer be optimal taking large time and space consumption into consideration. By analyzing the data failure trace of real distributed file system, we propose a node selection algorithm B-WSJ (bandwidth based weak and strong judgement) based on the existing algorithms and repairing model with the characteristic of parallelism which is suitable for multi-failure scenario. In order to describe the algorithm better, we firstly define several concepts of node-relationship on a link. Secondly we use these concepts to realize the weak and strong judgment of target node with pre-process and pruning strategy added. Finally, the nodes with better bandwidth will be chosen. To evaluate the performance of NS algorithm, we use Waxman algorithm to generate network topology and do many experiments with node failure models in real system provided by FTA (failure trace archive). The experimental results show the performance of B-WSJ algorithm can be improved greatly compared with the existing algorithms.

Key words: distributed storage system, data repair, regeneration time, multi-node failure, node selection

中图分类号: