Abstract | ||
---|---|---|
Training state-of-the-art artificial intelligence (AI) models requires scaling to many compute nodes and relies heavily on collective communication operations, such as all-reduce, to exchange the weight gradients between nodes. The overhead of these operations can bottleneck training performance as the number of nodes increases. In this paper, we first characterize the all-reduce operation overhead. Then, we propose a new smart network interface card (NIC) for distributed AI training using field-programmable gate arrays (FPGAs) to accelerate all-reduce operations and optimize bandwidth utilization via data compression. The AI smart NIC frees up the system's compute resources to perform the more compute-intensive tensor operations and increases the overall node-to-node communication efficiency. We build a prototype 6-node AI training system and show that our proposed FPGA-based AI smart NIC enhances overall training performance by 1.6×, with an estimated 2.5× performance improvement at 32 nodes. |
Year | DOI | Venue |
---|---|---|
2022 | 10.1109/LCA.2022.3189207 | IEEE Computer Architecture Letters |
Keywords | DocType | Volume |
AI training,all-reduce,smart NIC,FPGA | Journal | 21 |
Issue | ISSN | Citations |
2 | 1556-6056 | 0 |
PageRank | References | Authors |
0.34 | 4 | 6 |
Name | Order | Citations | PageRank |
---|---|---|---|
Rui Ma | 1 | 100 | 20.94 |
Evangelos Georganas | 2 | 2 | 1.04 |
Alexander Heinecke | 3 | 0 | 0.34 |
Sergey Gribok | 4 | 0 | 0.34 |
Andrew Boutros | 5 | 8 | 3.02 |
Eriko Nurvitadhi | 6 | 399 | 33.08 |