Title
Fighting The Fog Of War: Automated Incident Detection For Cloud Systems
Abstract
Incidents and outages dramatically degrade the availability of large-scale cloud computing systems such as AWS, Azure, and GCP. In current incident response practice, each team has only a partial view of the entire system, which makes the detection of incidents like fighting in the "fog of war". As a result, prolonged mitigation time and more financial loss are incurred. In this work, we propose an automatic incident detection system, namely Warden, as a part of the Incident Management (IcM) platform. Warden collects alerts from different services and detects the occurrence of incidents from a global perspective. For each detected potential incident, Warden notifies relevant on-call engineers so that they could properly prioritize their tasks and initiate cross-team collaboration. We implemented and deployed Warden in the IcM platform of Azure. Our evaluation results based on data collected in an 18-month period from 26 major services show that Warden is effective and outperforms the baseline methods. For the majority of successfully detected incidents (similar to 68%), Warden is faster than human, and this is particularly the case for the incidents that take long time to detect manually.
Year
Venue
DocType
2021
PROCEEDINGS OF THE 2021 USENIX ANNUAL TECHNICAL CONFERENCE
Conference
Citations 
PageRank 
References 
0
0.34
0
Authors
16
Name
Order
Citations
PageRank
Liqun Li100.34
Xu Zhang2252.85
Xin Zhao34712.60
Hongyu Zhang486450.03
Yu Kang5103.24
Pu Zhao687.23
Bo Qiao7339.09
Shilin He810.69
Pochian Lee900.34
Jeffrey Sun1000.34
feng gao115317.81
Li Yang1241.43
Qingwei Lin1300.34
Saravanakumar Rajmohan1413.39
Zhangwei Xu15112.59
Dongmei Zhang161439132.94