Title
Failure recovery: when the cure is worse than the disease
Abstract
Cloud services inevitably fail: machines lose power, networks become disconnected, pesky software bugs cause sporadic crashes, and so on. Unfortunately, failure recovery itself is often faulty; e.g. recovery can accidentally recursively replicate small failures to other machines until the entire cloud service fails in a catastrophic outage, amplifying a small cold into a contagious deadly plague! We propose that failure recovery should be engineered foremost according to the maxim of primum non nocere, that it "does no harm." Accordingly, we must consider the system holistically when failure occurs and recover only when observed activity safely allows for it.
Year
Venue
Keywords
2013
HotOS
failure recovery,pesky software bug,small failure,primum non nocere,sporadic crash,catastrophic outage,entire cloud service,cloud service,observed activity,small cold
Field
DocType
Citations 
Computer science,Computer security,Harm,Software bug,Maxim,Primum non nocere,Recursion,Cloud computing
Conference
18
PageRank 
References 
Authors
0.73
14
11
Name
Order
Citations
PageRank
Zhenyu Guo151239.61
Sean McDirmid217513.55
Mao Yang349630.94
Li Zhuang423810.65
Pu Zhang5180.73
Yingwei Luo631541.30
Tom Bergan7180.73
Peter Bodík8118251.66
Madan Musuvathi91167.62
Zheng Zhang10119373.82
Lidong Zhou112136147.82