Title
The Mystery of the Failing Jobs: Insights from Operational Data from Two University-Wide Computing Systems
Abstract
Node downtime and failed jobs in a computing cluster translate into wasted resources and user dissatisfaction. Therefore understanding why nodes and jobs fail in HPC clusters is essential. This paper provides analyses of node and job failures in two university-wide computing clusters at two Tier I US research universities. We analyzed approximately 3.0M job execution data of System A and 2.2M of System B with data sources coming from accounting logs, resource usage for all primary local and remote resources (memory, IO, network), and node failure data. We observe different kinds of correlations of failures with resource usages and propose a job failure prediction model to trigger event-driven checkpointing and avoid wasted work. Additionally, we present user history based resource usage and runtime prediction models. These models have the potential to avoid system related issues such as contention, and improve quality of service such as lower mean queue time, if their predictions are used to make a more informed scheduling decision. As a proof of concept, we simulate an easy backfill scheduler to use predictions of one of these models, i.e., runtime and show the improvements in terms of lower mean queue time. Arising out of these observations, we provide generalizable insights for cluster management to improve reliability, such as, for some execution environments local contention dominates, while for others system-wide contention dominates.
Year
DOI
Venue
2020
10.1109/DSN48063.2020.00034
2020 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)
Keywords
DocType
ISSN
HPC, Production failure data, Data analytics, Compute clusters
Conference
1530-0889
ISBN
Citations 
PageRank 
978-1-7281-5810-5
0
0.34
References 
Authors
55
10
Name
Order
Citations
PageRank
Rakesh Kumar100.34
Saurabh Jha2132.61
Ashraf Mahgoub311.03
Rajesh Kalyanam444.75
Stephen L. Harrell5324.40
Xiaohui Carol Song600.34
?zg???ner73318.65
William T. Kramer801.01
Ravishankar K. Iyer93489504.32
Saurabh Bagchi102022144.72