Title | ||
---|---|---|
rDLB: A Novel Approach for Robust Dynamic Load Balancing of Scientific Applications with Independent Tasks |
Abstract | ||
---|---|---|
Parallel scientific applications that execute on high performance computing (HPC) systems often contain large and computationally-intensive parallel loops. The independent loop iterations of such applications represent independent tasks. Dynamic toad balancing (DLB) is used to achieve a balanced execution of such applications. However, most of the self-scheduling-based techniques that are typically used to achieve DLB are not robust against component (e.g., processors, network) failures or perturbations that arise on large HPC systems. The self-scheduling-based techniques that tolerate failures and/or perturbations rely on the existence of fault-and/or perturbation-detection mechanisms to trigger the rescheduling of tasks scheduled onto failed and/or perturbed components. This work proposes a novel robust dynamic load balancing (rDLB) approach for the robust self-scheduling of scientific applications with independent tasks on HPC systems under failures and/or perturbations. rDLB proactively reschedules already allocated tasks and requires no detection of failures or perturbations. Moreover, rDLB is integrated into an MPI-based DLB library. An analytical modeling of rDLB shows that for a fixed problem size, the fault-tolerance overhead linearly decreases with the number of processors. The experimental evaluation shows that applications using rDLB tolerate up to P-l worker processor failures (P-is the number of processors allocated to the application) and that their performance in the presence of perturbations improved by a factor of 7 compared to the case without rDLB. Moreover, the robustness of applications against perturbations (i.e., flexibility) is boosted by a factor of 30 using rDLB compared to the case without rDLB. |
Year | DOI | Venue |
---|---|---|
2019 | 10.1109/HPCS48598.2019.9188153 | 2019 International Conference on High Performance Computing & Simulation (HPCS) |
Keywords | DocType | ISBN |
independent tasks,dynamic load balancing,self-scheduling,robustness,fail-stop failures,perturbations,high performance computing | Conference | 978-1-7281-4485-6 |
Citations | PageRank | References |
0 | 0.34 | 0 |
Authors | ||
3 |
Name | Order | Citations | PageRank |
---|---|---|---|
Ali Mohammed | 1 | 3 | 0.74 |
Aurélien Cavelan | 2 | 0 | 0.34 |
Ciorba Florina M. | 3 | 125 | 22.96 |