petaPar: A Scalable and Fault Tolerant Petascale Free Mesh Simulation System
-
Graphical Abstract
-
Abstract
With the emergence of petaflops (10\+15 FLOPS) systems, numerical simulation has entered a new era—a times opening a possibility of using 10\+4 to 10\+6 processor cores in one single run of parallel computing. In order to take full advantages of the powerfulness of the petaflops and post-petaflops supercomputing infrastructures, two aspects of grand challenges including the scalability and the fault tolerance must be addressed in a domain application. petaPar is a highly scalable and fault tolerant meshfree/particle simulation code dedicated to petascale computing. Two popular particle methods, smoothed particle hydrodynamics (SPH) and material point method (MPM), are implemented in a unified object-oriented framework. The parallelization of both SPH and MPM consistently starts from the domain decomposition of a regular background grid. The scalability of the code is assured by fully overlapping the inter-MPI process communication with computation and a dynamic load balance strategy. petaPar supports both flat MPI and MPI+Pthreads hierarchial parallelization. Application-specific lightweight checkpointing is used in petaPar to deal with the issue of fault tolerance. petaPar is designed to be able to automatically self-restart from any number of MPI processes, allow a dynamic change of computing resources arisen in a scenario of, for example, nodal failure and connection timeout etc. Experiments are performed on the Titan petaflops supercomputer. It is shown that petaPar linearly scales up to 2.6×10\+5 CPU cores with the excellent parallel efficiency of 100% and 96% for the multithreaded SPH and the multithreaded MPM, respectively, and the performance of the multithreaded SPH is improved by up to 30% compared with the flat MPI implementation.
-
-