Title
Towards intelligent incident management: why we need it and how we make it
Abstract
The management of cloud service incidents (unplanned interruptions or outages of a service/product) greatly affects customer satisfaction and business revenue. After years of efforts, cloud enterprises are able to solve most incidents automatically and timely. However, in practice, we still observe critical service incidents that occurred in an unexpected manner and orchestrated diagnosis workflow failed to mitigate them. In order to accelerate the understanding of unprecedented incidents and provide actionable recommendations, modern incident management system employs the strategy of AIOps (Artificial Intelligence for IT Operations). In this paper, to provide a broad view of industrial incident management and understand the modern incident management system, we conduct a comprehensive empirical study spanning over two years of incident management practices at Microsoft. Particularly, we identify two critical challenges (namely, incomplete service/resource dependencies and imprecise resource health assessment) and investigate the underlying reasons from the perspective of cloud system design and operations. We also present IcM BRAIN, our AIOps framework towards intelligent incident management, and show its practical benefits conveyed to the cloud services of Microsoft.
Year
DOI
Venue
2020
10.1145/3368089.3417055
ESEC/FSE '20: 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering Virtual Event USA November, 2020
DocType
ISBN
Citations 
Conference
978-1-4503-7043-1
3
PageRank 
References 
Authors
0.39
0
17
Name
Order
Citations
PageRank
Zhuangbin Chen1212.06
Yu Kang2103.24
Liqun Li330713.67
Xu Zhang4252.85
Hongyu Zhang586450.03
Hui Xu621229.73
Yangfan Zhou723229.72
Li Yang841.43
Jeffrey Sun930.73
Zhangwei Xu10112.59
Yingnong Dang1153726.92
feng gao125317.81
Pu Zhao1387.23
Bo Qiao14339.09
Qingwei Lin1528527.76
Dongmei Zhang161439132.94
Michael R. Lyu1763.15