Title
Resiliency of HPC Interconnects: A Case Study of Interconnect Failures and Recovery in Blue Waters.
Abstract
Availability of the interconnection network in high-performance computing (HPC) systems is fundamental to sustaining the continuous execution of applications at scale. When failures occur, interconnect recovery mechanisms orchestrate complex operations to recover network connectivity between the nodes. As the scale and design complexity of HPC systems increase, so does the system's susceptibility ...
Year
DOI
Venue
2018
10.1109/TDSC.2017.2737537
IEEE Transactions on Dependable and Secure Computing
Keywords
Field
DocType
Data security,Network security,Fault tolerance,Fault diagnosis,Multiprocessor interconnection,Data analysis
Psychological resilience,Network connectivity,Supercomputer,Computer science,Real-time computing,Fault tolerance,Interconnection,Multiprocessor interconnection,Blue Waters,Distributed computing
Journal
Volume
Issue
ISSN
15
6
1545-5971
Citations 
PageRank 
References 
2
0.36
0
Authors
7
Name
Order
Citations
PageRank
Saurabh Jha130.72
Valerio Formicola2607.90
Catello Di Martino321914.78
Mark Dalton420.36
William T. C. Kramer515611.36
Zbigniew Kalbarczyk61896159.48
Ravishankar K. Iyer73489504.32