Title
Minimizing Thermal Variation in Heterogeneous HPC Systems with FPGA Nodes
Abstract
The presence of FPGAs in data centers has been growing due to their superior performance as accelerators. Thermal management, particularly battling the cooling cost in these high performance systems, is a primary concern. Introduction of new heterogeneous components only adds further complexities to thermal modeling and management. The thermal behavior of multi-FPGA systems deployed within large compute clusters is little explored. In this paper, we first show that the thermal behaviors of different FPGAs of the same generation can vary due to their physical locations in a rack and process variation, even though they are running the same tasks. We present a machine learning based model to capture the thermal behavior of a multi-node FPGA cluster. We then propose to mitigate thermal variation and hotspots across the cluster by proactive task placement guided by our thermal model. Our experiments show that through proper placement of tasks on the multi-FPGA system, we can reduce the peak temperature by up to 11.50°C with no impact on performance.
Year
DOI
Venue
2018
10.1109/ICCD.2018.00086
2018 IEEE 36th International Conference on Computer Design (ICCD)
Keywords
Field
DocType
Thermal Modeling,HPC,Task Placement
Thermal model,Cluster (physics),Thermal,Rack,Hotspot (geology),Computer science,Parallel computing,Thermal variation,Field-programmable gate array,Process variation,Distributed computing
Conference
ISSN
ISBN
Citations 
1063-6404
978-1-5386-8478-8
1
PageRank 
References 
Authors
0.35
9
6
Name
Order
Citations
PageRank
Yingyi Luo110.69
Xiaoyang Wang210.35
Seda Öǧrenci Memik348842.57
Gokhan Memik41694111.88
Kazutomo Yoshii524918.53
Pete Beckman682248.04