ISSN 1000-1239 CN 11-1777/TP

Journal of Computer Research and Development ›› 2016, Vol. 53 ›› Issue (11): 2530-2541.doi: 10.7544/issn1000-1239.2016.20150619

Previous Articles     Next Articles

A Multi-Level Fault Tolerance Mechanism for Disk-Resident Pregel-Like Systems

Bi Yahui1, Jiang Suyang1, Wang Zhigang1, Leng Fangling1, Bao Yubin1, Yu Ge1, Qian Ling2   

  1. 1(College of Computer Science and Engineering, Northeastern University, Shenyang 110819); 2(China Mobile (Suzhou) Software Technology Co, Ltd, Suzhou, Jiangsu 215163)
  • Online:2016-11-01

Abstract: The BSP-based distributed frameworks, such as Pregel, are becoming a powerful tool for handling large-scale graphs, especially for applications with iterative computing frequently. Distributed systems can guarantee a flexible processing capacity by adding computing nodes, however, they also increase the probability of failures. Therefore, an efficient fault-tolerance mechanism is essential. Existing work mainly focuses on the checkpoint policy, including backup and recovery. The former usually backups all graph data, which leads to the cost of writing redundant data since some data are static during iterations. The latter always loads backup data from remote machines to recovery iterations, ignoring the usage of data in the local disk in special scenarios, which incurs network costs. It proposes a multi-level fault tolerant mechanism, which distinguishes failures into computing task failures and node failures, and then designs different strategies for backup and recovery. For the latter, considering that the volume of data involved in computation varies with iterations, a complete backup policy and an adaptive log-based policy are presented to reduce the cost of writing redundant data. After that, at the stages of recovery, we utilize the local graph data and the remote message data to handle the recovery for task failures, but the remote data are used for node failures. Finally, extensive experiments on real datasets validate the efficiency of our solutions.

Key words: fault tolerance, large-scale graph, iterative computing, BSP model, checkpoint

CLC Number: