Title
CuWide: Towards Efficient Flow-Based Training for Sparse Wide Models on GPUs
Abstract
Wide models such as generalized linear models and factorization-based models have been extensively used in various predictive applications, e.g., recommendation, CTR prediction, and image recognition. Due to the memory bounded property of the models, the performance improvement on CPU is reaching the limitation. GPU is known to have many computation units and high memory bandwidth, and becomes a promising platform for training machine learning models. However, the GPU training for the wide models is far from optimal due to the sparsity and irregularity in wide models. The existing GPU-based wide models are even slower than the ones using CPU. The classical training schema of the wide models does not optimized for the GPU architecture, which suffers from large amount of random memory accesses and redundant read/write of intermediate values. In this paper, we propose an efficient GPU-training framework for the large-scale wide models, named cuWide. To fully benefit from the memory hierarchy of GPU, cuWide applies a new flow-based schema for training, which leverages the spatial and temporal locality of wide models to drastically reduce the amount of communication with GPU global memory. To do so, we adopt a bigraph computation model to efficiently realize the flow-based schema and exploit three flexible interfaces for programming. Further, we use the 2D partition of mini-batch (in sample and feature dimensions) with proposed graph abstraction to optimize GPU memory access for sparse data, and apply several spatial-temporal caching mechanisms (importance-based model caching and cross-stage accumulation caching mechanisms) to achieve a high performance kernel. To efficiently implement cuWide, we also propose several GPU-oriented optimizations, including feature-oriented data layout to enhance the data locality, replication mechanism to reduce update conflicts in shared memory, and multi-stream scheduling to overlap data transferring and kernel computing. We show that cuWide can be up to more than 20× faster than the state-of-the-art GPU solutions and multi-core CPU solutions.
Year
DOI
Venue
2022
10.1109/TKDE.2020.3038109
IEEE Transactions on Knowledge and Data Engineering
Keywords
DocType
Volume
Machine learning,wide model,linear model,GPU acceleration,parallel computation,shared memory architecture
Journal
34
Issue
ISSN
Citations 
9
1041-4347
0
PageRank 
References 
Authors
0.34
31
7
Name
Order
Citations
PageRank
Xupeng Miao100.68
Lingxiao Ma2112.86
Zhi Yang337141.32
Yingxia Shao421324.25
Bin Cui51843124.59
Lele Yu6706.93
Jiawei Jiang78914.60