CuWide: Towards Efficient Flow-Based Training for Sparse Wide Models on GPUs - Citegraph

Paper Info

Title
CuWide: Towards Efficient Flow-Based Training for Sparse Wide Models on GPUs

Abstract
Wide models such as generalized linear models and factorization-based models have been extensively used in various predictive applications, e.g., recommendation, CTR prediction, and image recognition. Due to the memory bounded property of the models, the performance improvement on CPU is reaching the limitation. GPU is known to have many computation units and high memory bandwidth, and becomes a promising platform for training machine learning models. However, the GPU training for the wide models is far from optimal due to the sparsity and irregularity in wide models. The existing GPU-based wide models are even slower than the ones using CPU. The classical training schema of the wide models does not optimized for the GPU architecture, which suffers from large amount of random memory accesses and redundant read/write of intermediate values. In this paper, we propose an efficient GPU-training framework for the large-scale wide models, named cuWide. To fully benefit from the memory hierarchy of GPU, cuWide applies a new flow-based schema for training, which leverages the spatial and temporal locality of wide models to drastically reduce the amount of communication with GPU global memory. To do so, we adopt a bigraph computation model to efficiently realize the flow-based schema and exploit three flexible interfaces for programming. Further, we use the 2D partition of mini-batch (in sample and feature dimensions) with proposed graph abstraction to optimize GPU memory access for sparse data, and apply several spatial-temporal caching mechanisms (importance-based model caching and cross-stage accumulation caching mechanisms) to achieve a high performance kernel. To efficiently implement cuWide, we also propose several GPU-oriented optimizations, including feature-oriented data layout to enhance the data locality, replication mechanism to reduce update conflicts in shared memory, and multi-stream scheduling to overlap data transferring and kernel computing. We show that cuWide can be up to more than 20× faster than the state-of-the-art GPU solutions and multi-core CPU solutions.

Year	DOI	Venue
2022	10.1109/TKDE.2020.3038109	IEEE Transactions on Knowledge and Data Engineering
Keywords	DocType	Volume
Machine learning,wide model,linear model,GPU acceleration,parallel computation,shared memory architecture	Journal	34
Issue	ISSN	Citations
9	1041-4347	0
PageRank	References	Authors
0.34	31	7

Authors (7 rows)

Cited by (0 rows)

References (31 rows)

Name	Order	Citations	PageRank
Xupeng Miao	1	0	0.68
Lingxiao Ma	2	11	2.86
Zhi Yang	3	371	41.32
Yingxia Shao	4	213	24.25
Bin Cui	5	1843	124.59
Lele Yu	6	70	6.93
Jiawei Jiang	7	89	14.60

1