Title
Non-intrusively Avoiding Scaling Problems in and out of MPI Collectives
Abstract
It has been observed that scaling problems are highly likely to manifest when MPI applications are launched at a large scale where the scale is characterized by the degree of parallelism and the problem size. As the complexity of MPI collectives is directly impacted by both parallelism scale and problem size, their use often triggers scaling problems. Scaling problems' root cause can be outside of MPI libraries and these can be easily exposed via the dynamic interaction between user code and MPI library as the scale goes up. Specifically, irregular collectives suffer the most as the C int displacement array can easily be corrupted with integer overflow. Scaling problems can also result from a bug inside the released MPI libraries due to the lack of a systematic testing of MPI libraries as well as the platform or environment dependency of some scaling problems. Hence it is important for library users to perform testing on their platform to expose potential scaling problems. Fixing a scaling problem is challenging, and thus it usually takes much time for users to wait for an official fix, which sometimes is not even possible due to the difficulty of bug reproduction, root-cause identification, and fix development. To improve users' productivity, we establish the necessity of user side testing and provide a protection layer to avoid scaling problems non-intrusively, i.e., without requiring any changes to the MPI library or user programs. This provides an immediate remedy when an official fix is not readily available.
Year
DOI
Venue
2018
10.1109/IPDPSW.2018.00076
2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
Keywords
Field
DocType
MPI,Scaling problem,Workaround
Integer overflow,Degree of parallelism,Computer science,Parallel processing,Software bug,Parallel computing,Scaling,Root cause,Distributed computing,Systematic testing
Conference
ISSN
ISBN
Citations 
2164-7062
978-1-5386-5556-6
0
PageRank 
References 
Authors
0.34
15
4
Name
Order
Citations
PageRank
Hongbo Li120430.18
Zizhong Chen292469.93
rajiv gupta34301364.53
Min Xie4207.20