Title
Implementing Reliable Data Structures for MPI Services in High Component Count Systems
Abstract
High performance computing systems continue to grow: currently deployed systems exceed 160,000 cores and systems exceeding 1,000,000 cores are planned. Without significant improvements in component reliability, partial system failure modes could become an unacceptably regular occurrence, limiting the usability of advanced computing infrastructures. In this work, we intend to ease the development of survivable systems and applications through the implementation of a reliable key/value data store based on a distributed hash table (DHT). Borrowing from techniques developed for unreliable wide-area systems, we implemented a distributed data service built with MPI [1] that enables user data structures to survive partial system failure. The service is based on a new implementation of the Kademlia [2] distributed hash table.
Year
DOI
Venue
2009
10.1007/978-3-642-03770-2_39
PVM/MPI
Keywords
Field
DocType
user data structure,data service,partial system failure,advanced computing infrastructure,implementing reliable data structures,new implementation,partial system failure mode,hash table,mpi services,high performance computing system,value data,high component count systems,survivable system,failure mode,distributed hash table,data structure
Data structure,Supercomputer,Computer science,Usability,Data as a service,Kademlia,Operating system,Limiting,Distributed computing,Distributed hash table
Conference
Volume
ISSN
Citations 
5759
0302-9743
2
PageRank 
References 
Authors
0.38
8
6
Name
Order
Citations
PageRank
Justin M. Wozniak146435.32
Bryan Jacobs2151.36
Robert Latham31348.57
Sam Lang4552.85
Seung Woo Son529631.43
Robert B. Ross661.92