Title
Improving Congestion Control through Fine-Grain Monitoring of InfiniBand Networks
Abstract
Congestion situations are a serious threat to the performance of the interconnection networks of High-Performance Computing and Data-Center systems. Hence, the specifications of the main interconnect technologies, such as InfiniBand, define some mechanisms to deal with congestion and its effects. However, these standard mechanisms may not be suitable to detect or track accurately the actual status of network congestion, as congestion dynamics indeed can be very complex and varied. Moreover, achieving an optimal configuration of the parameters that drive the different functionalities of congestion-control mechanisms is often a difficult task, as some configurations may be suitable for some traffic scenarios, but not for others. In this paper, we propose combining an existing light-weight platform monitoring tool (LIMITLESS) with the InfiniBand control software (OpenSM), such that the metrics about communication volumes in the network provided by the former allow the latter having a more precise image of congestion status, then being able to react more efficiently in these situations. The main contributions of this paper are the methodology to link the monitor and OpenSM, as well as modifications in the InfiniBand standard congestion-control mechanism so that its reaction is modulated based on the enhanced knowledge about congestion provided by the monitor. These improvements are ready to be integrated into any InfiniBand-based system. According to the results from our experiments (performed in a real InfiniBand-based cluster where we run a widely used benchmark), the proposed approach reduces significantly the number of wrong detections of congestion, and so the number of times that the congestion-control mechanisms react unnecessarily, hence improving system performance up to 74%. The overhead of this monitoring tool is 0.1% in our experiments, collecting data each 200ms.
Year
DOI
Venue
2022
10.1109/HOTI55740.2022.00020
2022 IEEE Symposium on High-Performance Interconnects (HOTI)
Keywords
DocType
ISSN
Interconnection networks,cluster,congestion control,traffic monitoring,InfiniBand
Conference
1550-4794
ISBN
Citations 
PageRank 
978-1-6654-8680-4
0
0.34
References 
Authors
9
8