Title
Portable Parallel Implementation Of Blas-3
Abstract
The use of a massively parallel machine is aimed at the development of applications programs to solve most significant scientific, engineering, industrial and commercial problems.High-performance computing technology has emerged as a powerful and indispensable aid to scientific and engineering research, product and process development, and all aspects of manufacturing. Such computational power can be achieved only by massively parallel computers. It also requires a new and more effective mode of interaction between the computational sciences and applications and those parts of computer science concerned with the development of algorithms and software. We are interested in using parallel processing to handle large numerical tasks such as linear algebra problems. Yet, programming such systems has proven itself to be very complicated, error-prone and architecture-specific.One successful method for alleviating this problem, a method that worked well in the case of the massively pipelined supercomputers, is to use subprogram libraries. These libraries are built to efficiently perform some basic operations, while hiding low-level system specifics from the programmer. Efficiently porting a library to a new hardware, be it a vector machine or a shared memory or message passing based multiprocessor, is a major undertaking. It is a slow process that requires an intimate knowledge of the hardware features and optimization issues.We propose a scheme for the creation of portable implementations of such libraries. We present an implementation of BLAS (basic linear algebra subprograms), which is used as a standard linear algebra library. Our parallel implementation uses the virtual machine for multiprocessors (VMMP) (1990), which is a software package that provides a coherent set of services for explicitly parallel application programs running on diverse MIMD multiprocessors, both shared memory and message passing. VMMP is intended to simplify parallel program writing and to promote portable and efficient programming. Furthermore, it ensures high portability of application programs by implementating the same services on all target multiprocessors. Software created using this scheme is automatically efficient on both categories of MIMD machines, and on any hardware VMMP has been ported to.An additional level of abstraction is achieved using the programming language C++, an object-oriented language. Eckel, Stroustrup, 1989, 1986). For the programmer who is using BLAS-3, it is hiding both the data structures used to define linear algebra objects, and the parallel nature of the operations performed on these objects.We coded BLAS on top of VMMP. This code was run without any modifications on two shared memory machines-the commercial Sequent Symmetry and the experimental Taunop. (The code should run on any machine the VMMP was ported onto, given the availability or a C++ compiler). Performance results for this implementation are given. The speed-up of the BLAS-3 routines, tested on 22 processors of the Sequent, was in the range of 8.68 to 15.89. Application programs (e.g. Cholesky factorization) using the library routines achieved similar efficiency.
Year
DOI
Venue
1994
10.1002/cpe.4330060503
CONCURRENCY-PRACTICE AND EXPERIENCE
DocType
Volume
Issue
Journal
6
5
ISSN
Citations 
PageRank 
1040-3108
0
0.34
References 
Authors
13
4
Name
Order
Citations
PageRank
Amir Averbuch156973.41
Dganit Amitai2132.50
R. Friedman300.34
Eran Gabber4549112.28