Title
Fluid: Dataset Abstraction and Elastic Acceleration for Cloud-native Deep Learning Training Jobs
Abstract
Nowdays, it is prevalent to train deep learning (DL) models in cloud-native platforms that actively leverage containerization and orchestration technologies for high elasticity, low and flexible operation cost, and many other benefits. However, it also faces new challenges and our work is focusing on those related to I/O throughput for training, including complex data access with complicated performance tuning, lack of cache capacity with specialized hardware to match its high and dynamic I/O requirement, and inefficient I/O resource sharing across different training jobs. We propose Fluid, a cloud-native platform that provides DL training jobs with a data abstraction called Fluid Dataset to access training data from heterogeneous sources in a unified manner with transparent and elastic data acceleration powered by auto-tuned cache runtimes. In addition, it comes with an on-the-fly cache system autoscaler that can intelligently scale up and down the cache capacity to match the online training speed of each individual DL job. To improve the overall performance of multiple DL jobs, Fluid can co-orchestrate the data cache and DL jobs by arranging job scheduling in an appropriate order. Our experimental results show significant performance improvement of each individual DL job which uses dynamic computing resources with Fluid. In addition, for scheduling multiple DL jobs with same datasets, Fluid gives around 2x performance speedup when integrated with existing widely-used and cutting-edge scheduling solutions. Fluid is now an open source project hosted by Cloud Native Computing Foundation (CNCF) with adopters in production including Alibaba Cloud, Tencent Cloud, Weibo.com, China Telecom, etc.
Year
DOI
Venue
2022
10.1109/ICDE53745.2022.00209
2022 IEEE 38TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2022)
Keywords
DocType
ISSN
dataset abstraction, elastic data cache, job scheduling, cloud native
Conference
1084-4627
Citations 
PageRank 
References 
0
0.34
0
Authors
11
Name
Order
Citations
PageRank
Rong Gu111017.77
Kai Zhang200.34
Zhihao Xu300.68
Yang Che400.34
Bin Fan500.34
Haojun Hou600.34
Dai Haipeng741955.44
Li Yi800.34
Yu Ding900.34
guihai chen103537317.28
Huang, Yihua1116722.07