Title
Spread-n-share: improving application performance and cluster throughput with resource-aware job placement
Abstract
Traditional batch job schedulers adopt the Compact-n-Exclusive (CE) strategy, packing processes of a parallel job into as few compute nodes as possible. While CE minimizes inter-node network communication, it often brings self-contention among tasks of a resource-intensive application. Recent studies have used virtual containers to balance CPU utilization and memory capacity across physical nodes, but the imbalance in cache and memory bandwidth usage is still under-investigated. In this work, we propose Spread-n-Share (SNS): a new batch scheduling strategy that automatically scales resource-bound applications out onto more nodes to alleviate their performance bottleneck, and co-locate jobs in a resource compatible manner. We implement Uberun, a prototype scheduler to validate SNS, considering shared-cache capacity and memory bandwidth as two types of performance-critical shared resources. Experimental results using 12 diverse cluster workloads show that SNS improves the overall system throughput by 19.8% on average over CE, while achieving an average individual job speedup of 1.8%.
Year
DOI
Venue
2019
10.1145/3295500.3356152
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
Field
DocType
ISBN
Bottleneck,Memory bandwidth,Cache,CPU time,Computer science,Batch processing,Job scheduler,Throughput,Distributed computing,Speedup
Conference
978-1-4503-6229-0
Citations 
PageRank 
References 
0
0.34
0
Authors
7
Name
Order
Citations
PageRank
Xiongchao Tang1566.06
Haojie Wang223.75
Xiaosong Ma3111768.36
Nosayba El-Sayed41339.64
Jidong Zhai534036.27
Wenguang Chen6101470.57
Ashraf Aboulnaga7128991.33