Abstract | ||
---|---|---|
ABSTRACTModern high-performance computing (HPC) systems concurrently execute multiple distributed applications that contend for the high-speed network leading to congestion. Consequently, application runtime variability and suboptimal system utilization are observed in production systems. To address these problems, we propose Netscope, a congestion mitigation framework based on a novel delay sensitivity metric. Delay sensitivity of an application is used to quantify the impact of congestion on its runtime. Netscope uses delay sensitivity estimates to drive a congestion mitigation mechanism to selectively throttle applications that are less susceptible to congestion. We evaluate Netscope on two Cray Aries systems, including a production supercomputer, on common scientific applications. Our evaluation shows that Netscope has a low training cost and accurately estimates the impact of congestion on application runtime with a correlation between 0.7 and 0.9. Moreover, Netscope reduces application tail runtime increase by up to 16.3x while improving the median system utility by 12%. |
Year | DOI | Venue |
---|---|---|
2021 | 10.1145/3447818.3460362 | ICS |
DocType | Citations | PageRank |
Conference | 0 | 0.34 |
References | Authors | |
0 | 8 |
Name | Order | Citations | PageRank |
---|---|---|---|
Archit Patke | 1 | 0 | 1.35 |
Saurabh Jha | 2 | 9 | 2.94 |
Haoran Qiu | 3 | 0 | 1.35 |
Jim M. Brandt | 4 | 70 | 10.20 |
Ann C. Gentile | 5 | 37 | 7.91 |
Joe Greenseid | 6 | 0 | 0.34 |
?zg???ner | 7 | 33 | 18.65 |
Ravishankar K. Iyer | 8 | 3489 | 504.32 |