Title
Scaling Up Parallel Computation of Tiled QR Factorizations by a Distributed Scheduling Runtime System and Analytical Modeling.
Abstract
Implementing parallel software for QR factorizations to achieve scalable performance on massively parallel manycore systems requires a comprehensive design that includes algorithm redesign, efficient runtime systems, synchronization and communication reduction, and analytical performance modeling. This paper presents a piece of tiled communication-avoiding QR factorization software that is able to scale efficiently for matrices with general dimensions. We design a tiled communication-avoiding QR factorization algorithm and implement it with a fully distributed dynamic scheduling runtime system to minimize both synchronization and communication. The whole class of communication-avoiding QR factorization algorithms uses an important parameter of D (i.e., the number of domains), whose best solution is still unknown so far and requires manual tuning and empirical searching to find it. To that end, we introduce a simplified analytical performance model to determine an optimal number of domains D*. The experimental results show that our new parallel implementation is faster than a state-of-the-art multicore-based numerical library by up to 30%, and faster than ScaLAPACK by up to 30 times with thousands of CPU cores. Furthermore, using the new analytical model to predict an optimal number of domains is as competitive as exhaustive searching, and exhibits an average performance difference of 1%.
Year
DOI
Venue
2018
10.1142/S0129626418500044
PARALLEL PROCESSING LETTERS
Keywords
DocType
Volume
High performance computing,numerical libraries,analytical performance modeling
Journal
28
Issue
ISSN
Citations 
1
0129-6264
0
PageRank 
References 
Authors
0.34
3
4
Name
Order
Citations
PageRank
Weijian Zheng101.69
Fengguang Song222.42
Lan Lin348.21
Zizhong Chen492469.93