Title
OS-level hang detection in complex software systems
Abstract
Many critical services are nowadays provided by large and complex software systems. However, the increasing complexity introduces several sources of non-determinism, which may lead to hang failures: the system appears to be running, but part of its services is perceived as unresponsive. Online monitoring is the only way to detect and to promptly react to such failures. However, when dealing with off-the-shelf-based systems, online detection can be tricky since instrumentation and log data collection may not be feasible in practice. In this paper, a detection framework to cope with software hangs is proposed. The framework enables the non-intrusive monitoring of complex systems, based on multiple sources of data gathered at the operating system (OS) level. Collected data are then combined to reveal hang failures. The framework is evaluated through a fault injection campaign on two complex systems from the air traffic management (ATM) domain. Results show that the combination of several monitors at the OS level is effective to detect hang failures in terms of coverage and false positives and with a negligible impact on performance.
Year
DOI
Venue
2011
10.1504/IJCCBS.2011.042333
IJCCBS
Keywords
Field
DocType
off-the-shelf-based system,online monitoring,detection framework,complex system,complex software system,non-intrusive monitoring,os level,software hang,online detection,log data collection,operating systems
Air traffic management,Computer science,Software system,Real-time computing,Software,Hang,Atmosphere (unit),Fault injection,Distributed computing,Data collection,Operating system,False positive paradox,Embedded system
Journal
Volume
Issue
Citations 
2
3/4
2
PageRank 
References 
Authors
0.36
20
5
Name
Order
Citations
PageRank
Antonio Bovenzi1544.37
Marcello Cinque228633.58
Domenico Cotroneo397479.93
Roberto Natella445833.90
Gabriella Carrozza58411.39