Title
SciSpot: Scientific Computing On Temporally Constrained Cloud Preemptible VMs
Abstract
Scientific computing applications are being increasingly deployed on cloud computing platforms. Transient servers such as EC2 spot instances and Google Preemptible VMs, can be used to lower the costs of running applications on the cloud by up to <inline-formula><tex-math notation="LaTeX">$10\times$</tex-math></inline-formula> . However, the frequent preemptions and resource heterogeneity of these transient servers introduces many challenges in their effective and efficient use. In this paper, we develop techniques for modeling and mitigating preemptions of transient servers, and present SciSpot, a software framework that enables low-cost scientific computing on the cloud. SciSpot deploys applications on Google Cloud Preemptible Virtual Machines that exhibit temporally constrained preemptions: VMs are always preempted in a 24 hour interval. Our empirical analysis shows that the preemption rate is generally bathtub shaped, which raises multiple fundamental challenges in performance modeling and policy design. We develop a new reliability model for temporally constrained preemptions, and use statistical mechanics to show why the bathtub shape is generally exhibited. SciSpot’s design is guided by our observation that many emerging scientific computing applications that integrate machine learning with simulations, can be deployed as “bags” of jobs, which represent multiple instantiations of the same computation with different physical model parameters. For a bag of jobs, SciSpot finds the optimal transient server on-the-fly, by taking into account the price, performance, and preemption rates of different servers. SciSpot reduces costs by <inline-formula><tex-math notation="LaTeX">$5\times$</tex-math></inline-formula> compared to conventional cloud deployments, and reduces makespans by up to <inline-formula><tex-math notation="LaTeX">$10\times$</tex-math></inline-formula> compared to conventional high performance computing clusters.
Year
DOI
Venue
2022
10.1109/TPDS.2022.3157272
IEEE Transactions on Parallel and Distributed Systems
Keywords
DocType
Volume
Distributed systems,cloud computing,scientific computing
Journal
33
Issue
ISSN
Citations 
12
1045-9219
0
PageRank 
References 
Authors
0.34
28
3
Name
Order
Citations
PageRank
J. C. S. Kadupitiya100.34
Vikram Jadhao200.34
Prateek Sharma3113.23