Title
Outage Prediction and Diagnosis for Cloud Service Systems
Abstract
With the rapid growth of cloud service systems and their increasing complexity, service failures become unavoidable. Outages, which are critical service failures, could dramatically degrade system availability and impact user experience. To minimize service downtime and ensure high system availability, we develop an intelligent outage management approach, called AirAlert, which can forecast the occurrence of outages before they actually happen and diagnose the root cause after they indeed occur. AirAlert works as a global watcher for the entire cloud system, which collects all alerting signals, detects dependency among signals and proactively predicts outages that may happen anywhere in the whole cloud system. We analyze the relationships between outages and alerting signals by leveraging Bayesian network and predict outages using a robust gradient boosting tree based classification method. The proposed outage management approach is evaluated using the outage dataset collected from a Microsoft cloud system and the results confirm the effectiveness of the proposed approach.
Year
DOI
Venue
2019
10.1145/3308558.3313501
WWW '19: The Web Conference on The World Wide Web Conference WWW 2019
Keywords
Field
DocType
Outage prediction, cloud system, outage diagnosis, service availability, system of systems
Data mining,User experience design,Cloud systems,Computer science,System of systems,Bayesian network,Root cause,Downtime,Distributed computing,Gradient boosting,Cloud computing
Conference
ISBN
Citations 
PageRank 
978-1-4503-6674-8
4
0.44
References 
Authors
0
12
Name
Order
Citations
PageRank
Yujun Chen161.81
Xian Yang293.92
Qingwei Lin328527.76
Hongyu Zhang486450.03
feng gao55317.81
Zhangwei Xu6112.59
Yingnong Dang753726.92
Dongmei Zhang81439132.94
Hang Dong941.11
Yong Xu10413.21
Hao Li1193.25
Yu Kang12103.24