Title
CDBB: an NVRAM-based burst buffer coordination system for parallel file systems.
Abstract
For modern HPC systems, failures are treated as the norm instead of exceptions. To avoid rerunning applications from scratch, checkpoint/restart techniques are employed to periodically checkpoint intermediate data to parallel file systems. To increase HPC checkpointing speed, distributed burst buffers (DBB) have been proposed to use node-local NVRAM to absorb the bursty checkpoint data. However, without proper coordination, DBB is prone to suffer from low resource utilization. To solve this problem, we propose an NVRAM-based burst buffer coordination system, named collaborative distributed burst buffer (CDBB). CDBB coordinates all the available burst buffers, based on their priorities and states, to help overburdened burst buffers and maximize resource utilization. We built a proof-of-concept prototype and tested CDBB at the Minnesota Supercomputing Institute. Compared with a traditional DBB system, CDBB can speed up checkpointing by up to 8.4x under medium and heavy workloads and only introduces negligible overhead.
Year
Venue
Keywords
2018
Simulation Series
burst buffer,non-volatile memory,parallel file system,coordination system
Field
DocType
Volume
Supercomputer,Non-volatile random-access memory,Computer science,Non-volatile memory,Burst buffer,Operating system,Speedup
Conference
50
Issue
ISSN
Citations 
4
0735-9276
0
PageRank 
References 
Authors
0.34
0
5
Name
Order
Citations
PageRank
Ziqi Fan1223.12
Fenggang Wu2164.08
Jim Diehl3132.35
David Hung-Chang Du463474.40
Doug Voigt5113.25