Title
Balancing job performance with system performance via locality-aware scheduling on torus-connected systems
Abstract
Torus-connected network is widely used in modern supercomputers due to its linear per node cost scaling and its competitive overall performance. Job scheduling system plays a critical role for the efficient use of supercomputers. As supercomputers continue growing in size, a fundamental problem arises: how to effectively balance job performance with system performance on torus-connected machines? In this work, we will present a new scheduling design named window-based locality-aware scheduling. Our design contains three novel features. First, rather than one-by-one job scheduling, our design takes a “window” of jobs, i.e. multiple jobs, into consideration for job prioritizing and resource allocation. Second, our design maintains a list of slots to preserve node contiguity information for resource allocation. Finally, we formulate our scheduling decision making into a 0-1 Multiple Knapsack Problem and present two algorithms to solve the problem. A series of trace-based simulations using job logs collected from production supercomputers indicate that this new scheduling design has real potentials and can effectively balance job performance and system performance.
Year
DOI
Venue
2014
10.1109/CLUSTER.2014.6968751
CLUSTER
Keywords
DocType
ISSN
competitive overall performance,processor scheduling,supercomputers,torus-connected network,scheduling decision making,trace-based simulations,job prioritizing,node contiguity information,torus-connected machines,system performance,resource allocation,scheduling design,0-1 multiple knapsack problem,job scheduling system,job performance,knapsack problems,performance evaluation,window-based locality-aware scheduling,mobile computing,parallel machines
Conference
1552-5244
Citations 
PageRank 
References 
2
0.38
0
Authors
6
Name
Order
Citations
PageRank
Xu Yang1876.95
Zhou Zhou2856.02
Wei Tang315210.65
Xingwu Zheng420.38
Jia Wang514812.47
Zhiling Lan681854.25