Abstract | ||
---|---|---|
Our target in this work is to study ways of exploring the parallelism offered by vectorization on accelerators with very wide vector units. To this end, we implemented two kernels that derive from the Wilson Dslash operator and investigate several data layout techniques for increasing the scalability of lattice QCD scientific kernels suitable for the Intel Xeon Phi. In parts of the application where real numbers are used for computation, we see a 6.6x increase in bandwidth compared to scalar code, thanks to the auto-vectorization by the compiler. In other kernels where arithmetic operations on complex numbers dominate, our hand-vectorized code out-performs the auto-vectorization of the compiler. In this paper we find that our proposed Hopping Vector-friendly Ordering allows for more efficient vectorization of complex arithmetic floating point operations. Using this data layout, we manage to increase the sustained bandwidth by approximately 1.8x. |
Year | DOI | Venue |
---|---|---|
2016 | 10.1109/PDP.2016.116 | 2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP) |
Keywords | Field | DocType |
Lattice QCD,Xeon Phi,many-cores,accelerators | Kernel (linear algebra),Xeon Phi,Floating point,Computer science,Parallel computing,Vectorization (mathematics),Compiler,Lattice QCD,Coprocessor,Scalability | Conference |
ISSN | Citations | PageRank |
1066-6192 | 1 | 0.35 |
References | Authors | |
7 | 3 |
Name | Order | Citations | PageRank |
---|---|---|---|
Andreas Diavastos | 1 | 9 | 4.48 |
Giannos Stylianou | 2 | 1 | 1.03 |
Giannis Koutsou | 3 | 1 | 0.69 |