Title
Lessons learned from development and operation of the K computer.
Abstract
Operational experiences of one of the most powerful supercomputer is reported.Failure rates of major components and MTBF are evaluated.The causes of severe failures which gives serious impacts in operation are analyzed. We report operational experiences of the K computer which is one of the most powerful supercomputers in the world. The K computer achieved excellent results for system availability, job-filling rate and failure rate. On the other hand, approximately 70% of the unscheduled system stop time was caused by file system failures. We analyzed the reasons for the failures and found that a massive and complex system configuration of the file system is one of the crucial factors for the failures. It revealed many potential bugs in the file system software, and such bugs caused many failures which gave severe impacts to the operation.
Year
DOI
Venue
2017
10.1016/j.parco.2017.03.001
Parallel Computing
Keywords
Field
DocType
The K computer,Operation improvement,Failure analysis,Parallel file system
File system,Computer science,Parallel computing,Failure rate,Software,Operating system,Embedded system
Journal
Volume
Issue
ISSN
64
C
0167-8191
Citations 
PageRank 
References 
1
0.36
5
Authors
1
Name
Order
Citations
PageRank
Fumiyoshi Shoji1527.36