Title
Evolve or Die: High-Availability Design Principles Drawn from Googles Network Infrastructure.
Abstract
Maintaining the highest levels of availability for content providers is challenging in the face of scale, network evolution and complexity. Little, however, is known about failures large content providers are susceptible to, and what mechanisms they employ to ensure high availability. From a detailed analysis of over 100 high-impact failure events in a global-scale content provider encompassing several data centers and two WANs, we quantify several dimensions of availability failures. We find that failures are evenly distributed across different network types and planes, but that a large number of failures happen when a management operation is in progress within the network. We discuss some of these failures in detail, and also describe our design principles for high availability motivated by these failures, including using defense in depth, maintaining consistency across planes, failing open on large failures, carefully preventing and avoiding failures, and assessing root cause quickly. Our findings suggest that, as networks become more complicated, failures lurk everywhere, and, counter-intuitively, continuous incremental evolution of the network can, when applied together with our design principles, result in a more robust network.
Year
DOI
Venue
2016
10.1145/2934872.2934891
SIGCOMM
Keywords
Field
DocType
Availability, Control Plane, Management Plane
Routing control plane,Design elements and principles,Incremental evolution,Computer science,Computer network,High availability,Root cause,Management plane,Distributed computing
Conference
Citations 
PageRank 
References 
41
1.46
14
Authors
5
Name
Order
Citations
PageRank
ramesh govindan1154302144.86
Ina Minei2411.46
Mahesh Kallahalla3411.46
Bikash Koley413411.14
Amin Vahdat510369842.39