SLOAVx: Scalable LOgarithmic AlltoallV Algorithm for Hierarchical Multicore Systems - Citegraph

Paper Info

Title
SLOAVx: Scalable LOgarithmic AlltoallV Algorithm for Hierarchical Multicore Systems

Abstract
Scientific applications use collective communication operations in Message Passing Interface (MPI) for global synchronization and data exchanges. Alltoall and AlltoallV are two important collective operations. They are used by MPI jobs to exchange messages among all of MPI processes. AlltoallV is a generalization of Alltoall, supporting messages of varying sizes. However, the existing MPI AlltoallV implementation has linear complexity, i.e., each process has to send messages to all other processes in the job. Such linear complexity can result in sub optimal scalability of MPI applications when they are deployed on millions of cores. To address above challenge, in this paper, we introduce a new Scalable LOgarithmic AlltoallV algorithm, named SLOAV, for MPI AlltoallV collective operation. SLOAV aims to achieve global exchange of small messages of different sizes in a logarithmic number of rounds. Furthermore, given the prevalence of multicore systems with shared memory, we design a hierarchical AlltoallV algorithm based on SLOAV by leveraging the advantages of shared memory, which is referred to as SLOAVx. Compared to SLOAV, SLOAVx significantly reduces the inter-node communication, thus improving the entire system performance and mitigating the impact of message latency. We have implemented and embedded both algorithms in Open MPI. Our evaluation on large-scale computer systems shows that for the 8-byte and 1024-process MPI Alltoallv operation, the SLOAV can reduce the latency by as much as 86.4%, when compared to the state-of-the-art, and SLOAVx can further optimize the SLOAV by up to 83.1% in terms of message latency on multicore systems. In addition, experiments with NAS Parallel Benchmark (NPB) demonstrate that our algorithms are very effective for real-world applications.

Year	DOI	Venue
2013	10.1109/CCGrid.2013.22	CCGrid
Keywords	Field	DocType
nas parallel benchmark,scientific application,latency reduction,shared memory,scalability,linear complexity,data exchange,mpi application,message size,hierarchical multicore system,suboptimal scalability,system performance,natural sciences computing,large-scale computer system,internode communication,message passing interface,message exchange,message sending,mpi process,computational complexity,shared memory systems,alltoallv algorithm,sloavx,message latency,npb,mpi,message passing,scalable logarithmic alltoallv algorithm,global synchronization,collective communication operation,collectives,mpi job,mpi alltoallv collective operation,clustering algorithms,multicore processing,algorithm design and analysis,benchmark testing,optimization	Synchronization,Shared memory,Computer science,Latency (engineering),Parallel computing,Algorithm,Message Passing Interface,Logarithm,Message passing,Computational complexity theory,Scalability,Distributed computing	Conference
ISSN	ISBN	Citations
2376-4414	978-1-4673-6465-2	5
PageRank	References	Authors
0.49	9	6

Authors (6 rows)

Cited by (5 rows)

References (9 rows)

Name	Order	Citations	PageRank
Cong Xu	1	50	4.38
Manjunath Gorentla Venkata	2	86	20.02
Richard L. Graham	3	954	73.91
Yandong Wang	4	342	18.88
Zhuo Liu	5	118	16.03
Weikuan Yu	6	1042	77.40

1