Abstract | ||
---|---|---|
Incidents and outages dramatically degrade the availability of large-scale cloud computing systems such as AWS, Azure, and GCP. In current incident response practice, each team has only a partial view of the entire system, which makes the detection of incidents like fighting in the "fog of war". As a result, prolonged mitigation time and more financial loss are incurred. In this work, we propose an automatic incident detection system, namely Warden, as a part of the Incident Management (IcM) platform. Warden collects alerts from different services and detects the occurrence of incidents from a global perspective. For each detected potential incident, Warden notifies relevant on-call engineers so that they could properly prioritize their tasks and initiate cross-team collaboration. We implemented and deployed Warden in the IcM platform of Azure. Our evaluation results based on data collected in an 18-month period from 26 major services show that Warden is effective and outperforms the baseline methods. For the majority of successfully detected incidents (similar to 68%), Warden is faster than human, and this is particularly the case for the incidents that take long time to detect manually. |
Year | Venue | DocType |
---|---|---|
2021 | PROCEEDINGS OF THE 2021 USENIX ANNUAL TECHNICAL CONFERENCE | Conference |
Citations | PageRank | References |
0 | 0.34 | 0 |
Authors | ||
16 |
Name | Order | Citations | PageRank |
---|---|---|---|
Liqun Li | 1 | 0 | 0.34 |
Xu Zhang | 2 | 25 | 2.85 |
Xin Zhao | 3 | 47 | 12.60 |
Hongyu Zhang | 4 | 864 | 50.03 |
Yu Kang | 5 | 10 | 3.24 |
Pu Zhao | 6 | 8 | 7.23 |
Bo Qiao | 7 | 33 | 9.09 |
Shilin He | 8 | 1 | 0.69 |
Pochian Lee | 9 | 0 | 0.34 |
Jeffrey Sun | 10 | 0 | 0.34 |
feng gao | 11 | 53 | 17.81 |
Li Yang | 12 | 4 | 1.43 |
Qingwei Lin | 13 | 0 | 0.34 |
Saravanakumar Rajmohan | 14 | 1 | 3.39 |
Zhangwei Xu | 15 | 11 | 2.59 |
Dongmei Zhang | 16 | 1439 | 132.94 |