Abstract:
Distributed system provides an effective method to build high performance computing system with relative low cost. During the running of distributed computing, deadlock would happen if the resource allocation and requirement confliction occur. All the processes are waiting each other for the holding resources releasing, and then the system running stops. In an unreliable distributed system, failures may prevent deadlock detection algorithms from properly detecting deadlocks. Few of the algorithms proposed in the literatures address the issue of handing these failures. In this paper, three types of failures are identified and a fault tolerance deadlock detection and resolution algorithm is proposed. Failures are treated as different detection termination conditions in the algorithm. The algorithm is based on a generalized AND-OR model and employs diffusion computation and centralized reduction methods to detect deadlocks. The algorithm distinguishes cycles and knots and gives all members of a cycle. The upper limit of the time and space complexity is d and e+n-1 respectively if the deadlock topology is a static cycle, where e is the number of the edge and n and d are the number of the nodes in the wait-for graph. The performance of the proposed algorithm is equal to or better than that of the current algorithms.