Title
Near-Realtime Server Reboot Monitoring and Root Cause Analysis in a Large-Scale System
Abstract
Large-scale Internet services run on a fleet of distributed servers, and the continuous availability of the hardware is key to the robustness of the services. Unplanned reboots disrupt the services running on the hardware and lower the fleet availability. Server reboots are also important signals that could indicate underlying issues such as memory leaks from the services, catastrophic hardware fa...
Year
DOI
Venue
2021
10.1109/DSN-S52858.2021.00027
2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks - Supplemental Volume (DSN-S)
Keywords
DocType
ISSN
server reboots,datacenter,availability,near realtime,large scale production system,data engineering
Conference
1530-0889
ISBN
Citations 
PageRank 
978-1-6654-3566-6
0
0.34
References 
Authors
0
7
Name
Order
Citations
PageRank
Fred Lin100.34
Bhargav Bolla200.34
Eric Pinkham300.34
Neil Kodner400.34
Daniel Moore500.34
Amol Desai600.68
Sriram Sankar711.70