Title
Exploiting the behavior of the failed job in high performance computing system
Abstract
As demand for high-performance computing power is increasing, operation management technologies like check-pointing, failure-aware task scheduling, and system simulations are becoming more important for the stable operation of the system. To maintain and manage a stable system, a detailed analysis of failed tasks is necessary. For this, this paper intends to analyze the characteristics of failed jobs in high performance computing system. Our contributions can be viewed in three ways. Firstly, it offers detailed analysis results of failed jobs based on the job logs of a currently operating supercomputer. Secondly, it offers not only an overall statistical analysis result but also identifies the distribution of the failed job submission inter-arrival time. Thirdly, it analyzes the occurrence probability of the event using hazard rate.
Year
DOI
Venue
2018
10.1109/ICCSA.2018.8439570
2018 18th International Conference on Computational Science and Applications (ICCSA)
Keywords
Field
DocType
Supercomputer,failed job behavior,stochastic analysis
Industrial engineering,Supercomputer,Scheduling (computing),Computer science,Stochastic process,Stable system,Statistical analysis,Distributed computing
Conference
ISBN
Citations 
PageRank 
978-1-5386-7215-0
0
0.34
References 
Authors
3
2
Name
Order
Citations
PageRank
Ju-Won Park1195.09
Eun-Hye Kim21910.40