Title
Integrating Inter-Node Communication with a Resilient Asynchronous Many-Task Runtime System
Abstract
Achieving fault tolerance is one of the significant challenges of exascale computing due to projected increases in soft/transient failures. While past work on software-based resilience techniques typically focused on traditional bulk-synchronous parallel programming models, we believe that Asynchronous Many-Task (AMT) programming models are better suited to enabling resiliency since they provide explicit abstractions of data and tasks which contribute to increased asynchrony and latency tolerance. In this paper, we extend our past work on enabling application-level resilience in single node AMT programs by integrating the capability to perform asynchronous MPI communication, thereby enabling resiliency across multiple nodes. We also enable resilience against fail-stop errors where our runtime will manage all re-execution of tasks and communication without user intervention. Our results show that we are able to add communication operations to resilient programs with low overhead, by offloading communication to dedicated communication workers and also recover from fail-stop errors transparently, thereby enhancing productivity.
Year
DOI
Venue
2020
10.1109/ExaMPI52011.2020.00010
2020 Workshop on Exascale MPI (ExaMPI)
Keywords
DocType
ISBN
Resilience,AMT Runtimes,Habanero C/C++,MPI communication,Fenix,MPI-ULFM
Conference
978-1-6654-1562-0
Citations 
PageRank 
References 
1
0.36
0
Authors
8
Name
Order
Citations
PageRank
Sri Raj Paul131.75
Akihiro Hayashi211.04
Matthew Whitlock310.36
Seonmyeong Bak410.36
Keita Teranishi541.09
Jackson Mayo6437.97
Max Grossman79110.48
Vivek Sarkar84318409.41