Title
Identifying linked incidents in large-scale online service systems
Abstract
In large-scale online service systems, incidents occur frequently due to a variety of causes, from updates of software and hardware to changes in operation environment. These incidents could significantly degrade system’s availability and customers’ satisfaction. Some incidents are linked because they are duplicate or inter-related. The linked incidents can greatly help on-call engineers find mitigation solutions and identify the root causes. In this work, we investigate the incidents and their links in a representative real-world incident management (IcM) system. Based on the identified indicators of linked incidents, we further propose LiDAR (Linked Incident identification with DAta-driven Representation), a deep learning based approach to incident linking. More specifically, we incorporate the textual description of incidents and structural information extracted from historical linked incidents to identify possible links among a large number of incidents. To show the effectiveness of our method, we apply our method to a real-world IcM system and find that our method outperforms other state-of-the-art methods.
Year
DOI
Venue
2020
10.1145/3368089.3409768
ESEC/FSE '20: 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering Virtual Event USA November, 2020
DocType
ISBN
Citations 
Conference
978-1-4503-7043-1
0
PageRank 
References 
Authors
0.34
0
12
Name
Order
Citations
PageRank
Yujun Chen100.34
Xian Yang200.68
Hang Dong341.11
Xiaoting He4362.32
Hongyu Zhang586450.03
Qingwei Lin628527.76
Junjie Chen78314.71
Pu Zhao887.23
Yu Kang9103.24
feng gao105317.81
Zhangwei Xu11112.59
Dongmei Zhang121439132.94