Abstract | ||
---|---|---|
Pilot-Job systems play an important role in supporting distributed scientific computing. They are used to execute millions of jobs on several cyberinfrastructures worldwide, consuming billions of CPU hours a year. With the increasing importance of task-level parallelism in high-performance computing, Pilot-Job systems are also witnessing an adoption beyond traditional domains. Notwithstanding the growing impact on scientific research, there is no agreement on a definition of Pilot-Job system and no clear understanding of the underlying abstraction and paradigm. Pilot-Job implementations have proliferated with no shared best practices or open interfaces and little interoperability. Ultimately, this is hindering the realization of the full impact of Pilot-Jobs by limiting their robustness, portability, and maintainability. This article offers a comprehensive analysis of Pilot-Job systems critically assessing their motivations, evolution, properties, and implementation. The three main contributions of this article are as follows: (1) an analysis of the motivations and evolution of Pilot-Job systems; (2) an outline of the Pilot abstraction, its distinguishing logical components and functionalities, its terminology, and its architecture pattern; and (3) the description of core and auxiliary properties of Pilot-Jobs systems and the analysis of six exemplar Pilot-Job implementations. Together, these contributions illustrate the Pilot paradigm, its generality, and how it helps to address some challenges in distributed scientific computing.
|
Year | DOI | Venue |
---|---|---|
2018 | 10.1145/3177851 | ACM Computing Surveys (CSUR) |
Keywords | DocType | Volume |
Distributed applications, Pilot-Jobs, distributed systems | Journal | 51 |
Issue | ISSN | Citations |
2 | 0360-0300 | 5 |
PageRank | References | Authors |
0.50 | 0 | 3 |
Name | Order | Citations | PageRank |
---|---|---|---|
Matteo Turilli | 1 | 84 | 16.21 |
Mark Santcroos | 2 | 70 | 8.11 |
Shantenu Jha | 3 | 188 | 32.40 |