Title
Extended task queuing: active messages for heterogeneous systems.
Abstract
Accelerators have emerged as an important component of modern cloud, datacenter, and HPC computing environments. However, launching tasks on remote accelerators across a network remains unwieldy, forcing programmers to send data in large chunks to amortize the transfer and launch overhead. By combining advances in intra-node accelerator unification with one-sided Remote Direct Memory Access (RDMA) communication primitives, it is possible to efficiently implement lightweight tasking across distributed-memory systems. This paper introduces Extended Task Queuing (XTQ), an RDMA-based active messaging mechanism for accelerators in distributed-memory systems. XTQ's direct NIC-to-accelerator communication decreases inter-node GPU task launch latency by 10-15% for small-to-medium sized messages and ameliorates CPU message servicing overheads. These benefits are shown in the context of MPI accumulate, reduce, and allreduce operations with up to 64 nodes. Finally, we illustrate how XTQ can improve the performance of popular deep learning workloads implemented in the Computational Network Toolkit (CNTK).
Year
DOI
Venue
2016
10.1109/SC.2016.79
SC
Keywords
Field
DocType
Accelerator architectures,Computer architecture,Computer networks,Distributed computing,Network interfaces
Latency (engineering),Computer science,Computer network,Queueing theory,Artificial intelligence,Remote direct memory access,Deep learning,Distributed computing,Network interface,Central processing unit,Parallel computing,Schedule,Cloud computing
Conference
ISBN
Citations 
PageRank 
978-1-4673-8815-3
3
0.39
References 
Authors
21
16