Title
Delay sensitivity-driven congestion mitigation for HPC systems
Abstract
ABSTRACTModern high-performance computing (HPC) systems concurrently execute multiple distributed applications that contend for the high-speed network leading to congestion. Consequently, application runtime variability and suboptimal system utilization are observed in production systems. To address these problems, we propose Netscope, a congestion mitigation framework based on a novel delay sensitivity metric. Delay sensitivity of an application is used to quantify the impact of congestion on its runtime. Netscope uses delay sensitivity estimates to drive a congestion mitigation mechanism to selectively throttle applications that are less susceptible to congestion. We evaluate Netscope on two Cray Aries systems, including a production supercomputer, on common scientific applications. Our evaluation shows that Netscope has a low training cost and accurately estimates the impact of congestion on application runtime with a correlation between 0.7 and 0.9. Moreover, Netscope reduces application tail runtime increase by up to 16.3x while improving the median system utility by 12%.
Year
DOI
Venue
2021
10.1145/3447818.3460362
ICS
DocType
Citations 
PageRank 
Conference
0
0.34
References 
Authors
0
8
Name
Order
Citations
PageRank
Archit Patke101.35
Saurabh Jha292.94
Haoran Qiu301.35
Jim M. Brandt47010.20
Ann C. Gentile5377.91
Joe Greenseid600.34
?zg???ner73318.65
Ravishankar K. Iyer83489504.32