Title
Toward achieving operational excellence in a cloud
Abstract
A cloud pools resources such as compute, network, and storage and delivers them quickly and automatically on-demand through software. In addition, it provides automatic and policy-driven management of resources through software. Such a system comprises many components, whose states change rapidly. To manage it effectively, cloud service providers need to clearly understand the behavior of operations across components, and be able to fix errors as early as possible. The task of building such capabilities (referred to as operational excellence) in a cloud system is challenging because components maintain internal state and interact in non-intuitive ways to perform automated operations. In this paper, we discuss the concept of operational excellence for a cloud system, discuss the challenges in achieving the operational excellence, and describe our vision. Toward our vision, we present a set of techniques to determine the causal sequences of system events across distributed components. We also model configured system states using casual sequences of system events, gather observed system states, and continuously verify the configured and observed states across system components. We apply these techniques to study OpenStack®, an open source infrastructure-as-a-service platform.
Year
DOI
Venue
2014
10.1147/JRD.2014.2298927
IBM Journal of Research and Development
Field
DocType
Volume
Operational excellence,Systems engineering,Cloud systems,Computer science,Software,Cloud service provider,Casual,Cloud computing
Journal
58
Issue
ISSN
Citations 
2-3
0018-8646
0
PageRank 
References 
Authors
0.34
8
5
Name
Order
Citations
PageRank
Salman Baset1699.66
Long Wang261.23
Byung Chul Tak319213.69
Cuong Pham400.34
Chunqiang Tang5404.64