Self healing in System-S - Citegraph

Paper Info

Title
Self healing in System-S

Abstract
Faults in a cluster are inevitable. The larger the cluster, the more likely the occurrence of some failure in hardware, in software, or by human error. System-S software must detect and self-repair failures while carrying out its prime directive--enabling stream processing program fragments to be distributed and connected to form complex applications. Depending on the type of failure, System-S may be able to continue with little or no disruption to potentially tens of thousands of interdependent and heterogeneous program fragments running across thousands of nodes.We extend the work we previously presented on the self healing nature of the job manager component in System-S by presenting how it can handle failures of other system components, applications and network infrastructure. We also evaluate the recoverability of the job management orchestrator component of System-S, considering crash failures with and without error propagation.

Year	DOI	Venue
2008	10.1007/s10586-008-0057-8	Cluster Computing
Keywords	Field	DocType
Fault-tolerance,Stream processing systems,Distributed recovery	Prime (order theory),Interdependence,Crash,Computer science,Real-time computing,Human error,Orchestration,Software,Fault tolerance,Stream processing,Distributed computing	Journal
Volume	Issue	ISSN
11	3	1386-7857
Citations	PageRank	References
2	0.39	15
Authors
5

Authors (5 rows)

Cited by (2 rows)

References (15 rows)

Name	Order	Citations	PageRank
Gabriela Jacques-Silva	1	171	11.81
Jim Challenger	2	345	48.04
Lou Degenaro	3	41	4.36
James Giles	4	31	2.26
Rohit Wagle	5	145	9.13

1