一种基于分布式存储系统中多节点修复的节点选择算法

刘佩; 蒋梓逸; 曹袖

doi:10.7544/issn1000-1239.2018.20160915

一种基于分布式存储系统中多节点修复的节点选择算法

Node Selection Algorithm During Multi-Nodes Repair Progress in Distributed Storage System

摘要

摘要: 在分布式存储系统中，如何优化失效数据的修复时间以保证系统的高可靠性，已引起了人们的广泛关注.近几年的研究发现修复过程中不同的节点选择机制对数据的再生时间产生很大的影响，已有工作提出了单节点失效场景下的节点选择SPSN (select provider select newcomer)算法，系统中往往存在多个节点同时修复的情况，此时，SPSN算法巨大的时空开销使得数据的再生时间不再最优.对已有真实系统的失效数据及原因进行统计；基于已有算法特点和修复模型，提出了具有更优的多节点选择B-WSJ(bandwidth based weak and strong judgement)算法.为了更好地描述算法，对带宽中节点的关系进行分类，算法利用节点关系分别实现了修复模型中目标节点的浅度和深度判断，并加入一定的预处理和剪枝策略，最终快速选择出具有较优带宽的节点集合.为了评估B-WSJ算法性能，使用Waxman算法产生网络拓扑，依据FTA(failure trace archive)网站所给的真实系统的节点失效模型进行多次实验，仿真结果表明：B-WSJ算法使得节点修复性能得到了很大的提升.

Abstract: In distributed storage systems, how to optimize the regeneration time of lost data so as to keep high reliability has attracted attention increasingly. Recent researches reveal that node selection mechanism during repair progress has great impact on regeneration time. SPSN (select provider select newcomer) algorithm has put forward by some studies, which suits the scenario of single node failure well. However, it is very common to repair many modes at the same time in actual system. In this scenario, SPSN algorithm will no longer be optimal taking large time and space consumption into consideration. By analyzing the data failure trace of real distributed file system, we propose a node selection algorithm B-WSJ (bandwidth based weak and strong judgement) based on the existing algorithms and repairing model with the characteristic of parallelism which is suitable for multi-failure scenario. In order to describe the algorithm better, we firstly define several concepts of node-relationship on a link. Secondly we use these concepts to realize the weak and strong judgment of target node with pre-process and pruning strategy added. Finally, the nodes with better bandwidth will be chosen. To evaluate the performance of NS algorithm, we use Waxman algorithm to generate network topology and do many experiments with node failure models in real system provided by FTA (failure trace archive). The experimental results show the performance of B-WSJ algorithm can be improved greatly compared with the existing algorithms.

HTML全文

参考文献(0)

施引文献

资源附件(0)