Abstract:
The scale of parallel computer systems is even larger. The dependability of the system and the tasks face the great challenges in the situation. The availability include the reliability and serviceability, thereby it is the core specification of describing the correct service capabilities in a massively parallel computer system. The quantitative evaluation of availability of massively parallel computer system is significant for system analysis and design. The useroriented availability models of parallel computer system which consider task characters and fault tolerance strategy are established by stochastic activity networks for two different examples in this paper: one is capability computing application with frequent communication among nodes, and the other is capacity computing application without communication. These models based on node module and networks module describe task running states and use useful work rate to measure the availability degree. The model includes the main factors that influence the availability of parallel computer system, which involve failure, hierarchical faulttolerance, fault detect, application characteristics, repair strategy and faulty coverage ratio, etc. Then, the model is computed and analyzed with the actual data. The models can evaluate the useroriented availability quantitatively, especially when the tasks are different and the parallel computer systems are the same.