Title
Modeling the Impact of Checkpoints on Next-Generation Systems
Abstract
The next generation of capability-class, massively parallel processing (MPP) systems is expected to have hundreds of thousands of processors. For application-driven, periodic checkpoint operations, the state-of-the-art does not provide a solution that scales to next-generation systems. We demonstrate this by using mathematical modeling to compute a lower bound of the impact of these approaches on the performance of applications executed on three massive-scale, in-production, DOE systems and a theoretical petaflop system. We also adapt the model to investigate a proposed optimization that makes use of "lightweight" storage architectures and overlay networks to overcome the storage system bottleneck. Our results indicate that (1) as we approach the scale of next-generation systems, traditional checkpoint/restart approaches will increasingly impact application performance, accounting for over 50% of total application execution time; (2) although our alternative approach improves performance, it has limitations of its own; and (3) there is a critical need for new approaches to fault tolerance that allow continuous computing with minimal impact on application scalability.
Year
DOI
Venue
2007
10.1109/MSST.2007.24
MSST
Keywords
Field
DocType
mathematical modeling,mathematical model,massively parallel processing,mathematical models,fault tolerant,overlay network,parallel processing,information systems,overlay networks,storage system,lower bound
Information system,Bottleneck,Computer science,Computer data storage,Massively parallel,Fault tolerance,Overlay network,Memory architecture,Distributed computing,Scalability
Conference
ISBN
Citations 
PageRank 
0-7695-3025-7
52
2.49
References 
Authors
27
7
Name
Order
Citations
PageRank
Ron Oldfield140818.71
Sarala Arunagiri2604.32
Patricia J. Teller329027.72
Seetharami Seelam411512.71
Maria Ruiz Varela5583.29
Rolf Riesen663652.64
Philip C. Roth774149.60