ISSN 1000-1239 CN 11-1777/TP

Journal of Computer Research and Development ›› 2017, Vol. 54 ›› Issue (12): 2858-2872.doi: 10.7544/issn1000-1239.2017.20160717

Previous Articles    

Criticality Checkpoint Management Strategy Based on RDD Characteristics in Spark

Ying Changtian1,2, Yu Jiong2, Bian Chen2, Wang Weiqing1,3, Lu Liang2, Qian Yurong2   

  1. 1(Postdoctoral Research Station of Electrical Engineering, Xinjiang University, Urumqi 830046); 2(School of Software, Xinjiang University, Urumqi 830008); 3(School of Electrical Engineering, Xinjiang University, Urumqi 830046)
  • Online:2017-12-01

Abstract: The default fault tolerance mechanism of Spark is setting the checkpoint by programmer. When facing data loss, Spark recomputes the tasks based on the RDD lineage to recovery the data. Meanwhile, in the circumstance of complicated application with multiple iterations and large amount of input data, the recovery process may cost a lot of computation time. In addition, the recompute task only considers the data locality by default regardless the computing capabilities of nodes, which increases the length of recovery time. To reduce recovery cost, we establish and demonstrate the Spark execution model, the checkpoint model and the RDD critically model. Based on the theory, the criticality checkpoint management (CCM) strategy is proposed, which includes the checkpoint algorithm, the failure recovery algorithm and the cleaning algorithm. The checkpoint algorithm is used to analyze the RDD charactersitics and its influence on the recovery time, and selects valuable RDDs as checkpoints. The failure recovery algorithm is used to choose the appropriate nodes to recompute the lost RDDs, and cleaning algorithm cleans checkpoints when the disk space becomes insufficient. Experimental results show that: the strategy can reduce the recovery overhead efficiently, select valuable RDDs as checkpoints, and increase the efficiency of disk usage on the nodes with sacrificing the execution time slightly.

Key words: memory computing, Spark, checkpoint management, failure recovery, RDD characteristics

CLC Number: