Title
Zero-Offload: Democratizing Billion-Scale Model Training
Abstract
Large-scale model training has been a playing ground for a limited few users, because it often requires complex model refactoring and access to prohibitively expensive GPU clusters. ZeRO-Offload changes the large model training landscape by making large model training accessible to nearly everyone. It can train models with over 13 billion parameters on a single GPU, a 10x increase in size compared to popular framework such as PyTorch, and it does so without requiring any model change from data scientists or sacrificing computational efficiency.ZeRO-Offload enables large model training by offloading data and compute to CPU. To preserve compute efficiency, it is designed to minimize data movement to/from GPU, and reduce CPU compute time while maximizing memory savings on GPU. As a result, ZeRO-Offload can achieve 40 TFlops/GPU on a single NVIDIA V100 GPU for 10B parameter model, compared to 30TF using PyTorch alone for a 1.4B parameter model, the largest that can be trained without running out of memory on GPU. ZeRO-Offload is also designed to scale on multiple-GPUs when available, offering near-linear speedup on up to 128 GPUs. Additionally, it can work together with model parallelism to train models with over 70 billion parameters on a single DGX-2 box, a 4.5x increase in model size compared to using model parallelism alone.By combining compute and memory efficiency with ease-of-use, ZeRO-Offload democratizes large-scale model training making it accessible to even data scientists with access to just a single GPU.
Year
Venue
DocType
2021
PROCEEDINGS OF THE 2021 USENIX ANNUAL TECHNICAL CONFERENCE
Conference
Citations 
PageRank 
References 
0
0.34
0
Authors
8
Name
Order
Citations
PageRank
Jie Ren15117.62
Samyam Rajbhandari2233.79
Reza Yazdani Aminabadi300.68
Olatunji Ruwase416714.40
Shuangyan Yang500.34
Minjia Zhang624.08
Li, Dong776448.56
Yuxiong He866640.52