Title
Gandalf - An Intelligent, End-To-End Analytics Service for Safe Deployment in Large-Scale Cloud Infrastructure.
Abstract
Modern cloud systems have a vast number of components that continuously undergo changes. Deploying these frequent updates quickly without breaking the system is challenging. In this paper, we present Gandalf, an end-to-end analytics service for safe deployment in a large-scale system infrastructure. Gandalf enables rapid and robust impact assessment of software rollouts to catch bad rollouts before they cause widespread outages. Gandalf monitors and analyzes various fault signals and correlates each signal against all the ongoing rollouts using a spatial and temporal correlation algorithm. Its core decision logic includes an ensemble ranking algorithm that determines which rollout caused the fault signals and a binary classifier that assesses the impact of the fault signals. The analysis result determines whether a rollout is safe to proceed or should be stopped. By using a lambda architecture, Gandalf provides both real-time and long-term deployment monitoring with automated decisions and notifications. Gandalf has been running in production in Microsoft Azure for more than 18 months, serving both data-plane and control-plane components. It achieves 92.4% precision and 100% recall (no high-impact service outages in Azure Compute were caused by bad rollouts) for data-plane rollouts. For control-plane rollouts, Gandalf achieves 94.87% precision and 99.84% recall.
Year
Venue
DocType
2020
NSDI
Conference
Citations 
PageRank 
References 
0
0.34
0
Authors
11
Name
Order
Citations
PageRank
Ze Li118420.82
Qian Cheng2201.07
Ken Hsieh3100.88
Yingnong Dang453726.92
Peng Huang501.01
Pankaj Singh600.34
Xinsheng Yang7211.43
Qingwei Lin828527.76
Youjiang Wu9101.56
Sebastien Levy1000.68
Murali Chintalapati11333.40