Title
Fast Dimensional Analysis for Root Cause Investigation in a Large-Scale Service Environment
Abstract
Root cause analysis in a large-scale production environment is challenging due to the complexity and scale of the services running across global data centers. It is often difficult to review the logs jointly for understanding production issues given the distributed nature of the system. Additionally, there could easily be millions of entities, each described by hundreds of features. In this paper we present a fast dimensional analysis framework that automates the root cause analysis on structured logs with improved scalability. We first explore item-sets, i.e. combinations of feature values, that could identify groups of samples with sufficient support for the target failures using the Apriori algorithm and a subsequent improvement, FP-Growth. These algorithms were designed for frequent item-set mining and association rule learning over transactional databases. After applying them on structured logs, we select the item-sets that are most unique to the target failures based on lift. We propose pre-processing steps with the use of a large-scale real-time database and post-processing techniques and parallelism to further speed up the analysis and improve interpretability, and demonstrate that such optimization is necessary for handling large- scale production datasets. We have successfully rolled out this approach for root cause investigation purposes within Facebook's infrastructure. We also present the setup and results from multiple production use cases in this paper.
Year
DOI
Venue
2020
10.1145/3393691.3394185
SIGMETRICS '20: ACM SIGMETRICS / International Conference on Measurement and Modeling of Computer Systems Boston MA USA June, 2020
DocType
Volume
Issue
Conference
4
2
ISBN
Citations 
PageRank 
978-1-4503-7985-4
0
0.34
References 
Authors
0
6
Name
Order
Citations
PageRank
Fan Lin100.34
Keyur Muzumdar200.34
Nikolay Laptev316311.07
Mihai-Valentin Curelea400.34
Seunghak Lee500.34
Sriram Sankar611.70