Title
Parastack: efficient hang detection for MPI programs at large scale
Abstract
While program hangs on large parallel systems can be detected via the widely used timeout mechanism, it is difficult for the users to set the timeout - too small a timeout leads to high false alarm rates and too large a timeout wastes a vast amount of valuable computing resources. To address the above problems with hang detection, this paper presents ParaStack, an extremely lightweight tool to detect hangs in a timely manner with high accuracy, negligible overhead with great scalability, and without requiring the user to select a timeout value. For a detected hang, it provides direction for further analysis by telling users whether the hang is the result of an error in the computation phase or the communication phase. For a computation-error induced hang, our tool pinpoints the faulty process by excluding hundreds and thousands of other processes. We have adapted ParaStack to work with the Torque and Slurm parallel batch schedulers and validated its functionality and performance on Tianhe-2 and Stampede that are respectively the world's current 2nd and 12th fastest supercomputers. Experimental results demonstrate that ParaStack detects hangs in a timely manner at negligible overhead with over 99% accuracy. No false alarm is observed in correct runs taking 66 hours at scale of 256 processes and 39.7 hours at scale of 1024 processes. ParaStack accurately reports the faulty process for computation-error induced hangs.
Year
DOI
Venue
2017
10.1145/3126908.3126938
SC
Keywords
Field
DocType
MPI programs,timeout mechanism,computation phase,communication phase,computation-error induced hang,efficient hang detection,large parallel systems,ParaStack,Torque and Slurm parallel batch schedulers
Torque,False alarm,Computer science,Parallel computing,Real-time computing,Timeout,Hang,Scalability,Computation,Distributed computing
Conference
ISSN
ISBN
Citations 
2167-4329
978-1-4503-5114-0
3
PageRank 
References 
Authors
0.41
20
3
Name
Order
Citations
PageRank
Hongbo Li120430.18
Zizhong Chen292469.93
rajiv gupta34301364.53