Abstract | ||
---|---|---|
ABSTRACTI/O is emerging as a major bottleneck for machine learning training, especially in distributed environments. Indeed, at large scale, I/O takes as much as 85% of training time. Addressing this I/O bottleneck necessitates careful optimization, as optimal data ingestion pipelines differ between systems, and require a delicate balance between access to local storage, external filesystems, and remote nodes. We introduce NoPFS, a machine learning I/O middleware, which provides a scalable, flexible, and easy-to-use solution to the I/O bottleneck. NoPFS uses clairvoyance: Given the seed generating the random access pattern for training with SGD, it can exactly predict when and where a sample will be accessed. We combine this with an analysis of access patterns and a performance model to provide distributed caching policies that adapt to different datasets and storage hierarchies. NoPFS reduces I/O times and improves end-to-end training by up to 5.4× on the ImageNet-1k, ImageNet-22k, and CosmoFlow datasets. |
Year | DOI | Venue |
---|---|---|
2021 | 10.1145/3458817.3476181 | SC |
Keywords | DocType | ISSN |
Deep learning,high-performance computing,I/O | Conference | 2167-4329 |
ISBN | Citations | PageRank |
978-1-6654-8390-2 | 2 | 0.37 |
References | Authors | |
30 | 4 |
Name | Order | Citations | PageRank |
---|---|---|---|
Nikoli Dryden | 1 | 2 | 0.71 |
Roman Böhringer | 2 | 2 | 0.37 |
Tal Ben-Nun | 3 | 116 | 14.21 |
Torsten Hoefler | 4 | 3 | 1.07 |