Title
In-Database Machine Learning with CorgiPile: Stochastic Gradient Descent without Full Data Shuffle
Abstract
Stochastic gradient descent (SGD) is the cornerstone of modern ML systems. Despite its computational efficiency, SGD requires random data access that is inherently inefficient when implemented in systems that rely on block-addressable secondary storage such as HDD and SSD, e.g., in-DB ML systems and TensorFlow/PyTorch over large files. To address this impedance mismatch, various data shuffling strategies have been proposed to balance the convergence rate of SGD (which favors randomness) and its I/O performance (which favors sequential access). In this paper, we first conduct a systematic empirical study on existing data shuffling strategies, which reveals that all existing strategies have room for improvement-they suffer in terms of I/O performance or convergence rate. With this in mind, we propose a simple but novel hierarchical data shuffling strategy, CorgiPile. Compared with existing strategies, CorgiPile avoids a full data shuffle while maintaining comparable convergence rate of SGD as if a full shuffle were performed. We provide a non-trivial theoretical analysis of CorgiPile on its convergence behavior. We further integrate CorgiPile into PostgreSQL by introducing three new physical operators with optimizations. Our experimental results show that CorgiPile can achieve comparable convergence rate to the full shuffle based SGD, and 1.6x-12.8x faster than two state-ofthe-art in-DB ML systems, Apache MADlib and Bismarck, on both HDD and SSD.
Year
DOI
Venue
2022
10.1145/3514221.3526150
PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA (SIGMOD '22)
Keywords
DocType
ISSN
In-database machine learning, Stochastic Gradient Descent, Shuffle
Conference
0730-8078
Citations 
PageRank 
References 
0
0.34
0
Authors
12
Name
Order
Citations
PageRank
Lijie Xu100.34
Shuang Qiu200.34
Binhang Yuan301.69
Jiawei Jiang48914.60
Cèdric Renggli594.23
Shaoduo Gan600.34
Kaan Kara700.34
Guoliang Li83077154.70
Ji Liu9135277.54
Wentao Wu1039430.53
Jieping Ye1100.34
Ce Zhang1280383.39