• 中国精品科技期刊
  • CCF推荐A类中文期刊
  • 计算领域高质量科技期刊T1类
Advanced Search
Bi Yahui, Jiang Suyang, Wang Zhigang, Leng Fangling, Bao Yubin, Yu Ge, Qian Ling. A Multi-Level Fault Tolerance Mechanism for Disk-Resident Pregel-Like Systems[J]. Journal of Computer Research and Development, 2016, 53(11): 2530-2541. DOI: 10.7544/issn1000-1239.2016.20150619
Citation: Bi Yahui, Jiang Suyang, Wang Zhigang, Leng Fangling, Bao Yubin, Yu Ge, Qian Ling. A Multi-Level Fault Tolerance Mechanism for Disk-Resident Pregel-Like Systems[J]. Journal of Computer Research and Development, 2016, 53(11): 2530-2541. DOI: 10.7544/issn1000-1239.2016.20150619

A Multi-Level Fault Tolerance Mechanism for Disk-Resident Pregel-Like Systems

More Information
  • Published Date: October 31, 2016
  • The BSP-based distributed frameworks, such as Pregel, are becoming a powerful tool for handling large-scale graphs, especially for applications with iterative computing frequently. Distributed systems can guarantee a flexible processing capacity by adding computing nodes, however, they also increase the probability of failures. Therefore, an efficient fault-tolerance mechanism is essential. Existing work mainly focuses on the checkpoint policy, including backup and recovery. The former usually backups all graph data, which leads to the cost of writing redundant data since some data are static during iterations. The latter always loads backup data from remote machines to recovery iterations, ignoring the usage of data in the local disk in special scenarios, which incurs network costs. It proposes a multi-level fault tolerant mechanism, which distinguishes failures into computing task failures and node failures, and then designs different strategies for backup and recovery. For the latter, considering that the volume of data involved in computation varies with iterations, a complete backup policy and an adaptive log-based policy are presented to reduce the cost of writing redundant data. After that, at the stages of recovery, we utilize the local graph data and the remote message data to handle the recovery for task failures, but the remote data are used for node failures. Finally, extensive experiments on real datasets validate the efficiency of our solutions.
  • Related Articles

    [1]Ying Changtian, Yu Jiong, Bian Chen, Wang Weiqing, Lu Liang, Qian Yurong. Criticality Checkpoint Management Strategy Based on RDD Characteristics in Spark[J]. Journal of Computer Research and Development, 2017, 54(12): 2858-2872. DOI: 10.7544/issn1000-1239.2017.20160717
    [2]Wan Hu, Xu Yuanchao, Yan Junfeng, Sun Fengyun, Zhang Weigong. Mitigating Log Cost through Non-Volatile Memory and Checkpoint Optimization[J]. Journal of Computer Research and Development, 2015, 52(6): 1351-1361. DOI: 10.7544/issn1000-1239.2015.20150171
    [3]Leng Fangling, Liu Jinpeng, Wang Zhigang, Chen Changning, Bao Yubin, Yu Ge, Deng Chao. Edge Cluster Based Large Graph Partitioning and Iterative Processing in BSP[J]. Journal of Computer Research and Development, 2015, 52(4): 960-971. DOI: 10.7544/issn1000-1239.2015.20131343
    [4]Yi Huizhan, Wang Feng, Zuo Ke, Yang Canqun, Du Yunfei, Ma Yaqing. Asynchronous Checkpoint/Restart Based on Memory Buffer[J]. Journal of Computer Research and Development, 2014, 51(6): 1229-1239.
    [5]Liao Guoqiong, Xiong Anjin, Di Guoqiang, Wan Changxuan, Xia Jiali. A Hybrid Checkpointing Strategy for Mobile Ad Hoc Networks[J]. Journal of Computer Research and Development, 2014, 51(6): 1176-1184.
    [6]Li Zhen, Tian Junfeng, and Yang Xiaohui. Dynamic Trustworthiness Evaluation Model of Software Based on Checkpoint's Classification Attributes[J]. Journal of Computer Research and Development, 2013, 50(11): 2397-2405.
    [7]Liang Yi, Wang Lei, Fan Jianping, Fang Juan. Research on the Shared Memory-Based Checkpointing for Cluster Services[J]. Journal of Computer Research and Development, 2010, 47(4): 571-580.
    [8]Tu Bibo, Hong Xuehai, Zhan Jianfeng, Fan Jianping. Workflow-Based User Environment for High Performance Computing[J]. Journal of Computer Research and Development, 2007, 44(10): 1717-1723.
    [9]Xie Min, Lu Yutong, Zhou Enqiang, Cao Hongjia, and Yang Xuejun. Implementation and Evaluation of MPI Checkpointing System over Lustre File System[J]. Journal of Computer Research and Development, 2007, 44(10): 1709-1716.
    [10]Zhou Enqiang, Lu Yutong, and Shen Zhiyu. Implementation of Checkpoint System Towards Large Scale Parallel Computing[J]. Journal of Computer Research and Development, 2005, 42(6): 987-992.

Catalog

    Article views (1194) PDF downloads (344) Cited by()

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return