Title
Quantifying the impact of network congestion on application performance and network metrics
Abstract
In modern high-performance computing (HPC) systems, network congestion is an important factor that contributes to performance degradation. However, how network congestion impacts application performance is not fully understood. As Aries network, a recent HPC network architecture featuring a dragonfly topology, is equipped with network counters measuring packet transmission statistics on each router, these network metrics can potentially be utilized to understand network performance. In this work, by experiments on a large HPC system, we quantify the impact of network congestion on various applications' performance in terms of execution time, and we correlate application performance with network metrics. Our results demonstrate diverse impacts of network congestion: while applications with intensive MPI operations (such as HACC and MILC) suffer from more than 40% extension in their execution times under network congestion, applications with less intensive MPI operations (such as Graph500 and HPCG) are mostly not affected. We also demonstrate that a stall-to-flit ratio metric derived from Aries network counters is positively correlated with performance degradation and, thus, this metric can serve as an indicator of network congestion in HPC systems.
Year
DOI
Venue
2020
10.1109/CLUSTER49012.2020.00026
2020 IEEE International Conference on Cluster Computing (CLUSTER)
Keywords
DocType
ISSN
HPC,network congestion,network counters
Conference
1552-5244
ISBN
Citations 
PageRank 
978-1-7281-6678-0
1
0.35
References 
Authors
14
5
Name
Order
Citations
PageRank
Yijia Zhang111314.67
Taylor L. Groves2267.20
Brandon Cook310.35
Nicholas J. Wright440827.79
Ayse K. Coskun557333.55