Abstract | ||
---|---|---|
Current processors incorporate wide and powerful vector units whose optimal exploitation is crucial to reach peak performance. However, present autovectorizing compilers fall short of that goal. Exploiting some vector instructions requires aggressive approaches that are not affordable in production compilers. Thus, advanced programmers pursuing the best performance from their applications are compelled to manually vectorize them using low-level SIMD intrinsics. We propose a user-directed code optimization that targets overlapped vector loads, i.e., vector loads that read scalar elements redundantly from memory. Instead, our optimization loads these elements once and combines them using advanced register-to-register vector instructions.This code is potentially more efficient and it uses advanced vector instructions that compilers do not widely exploit automatically. We also extend the OpenMP* SIMD directives with a new clause called overlap that allows users to easily enable and tune this optimization on demand. We implement our proposal for the Intel® Xeon Phi™ coprocessor. Our evaluation shows up to 29% speed-up over five highly-optimized stencil kernels and workloads from real-world applications. Results also demonstrate how important user hints are to maximize performance. |
Year | DOI | Venue |
---|---|---|
2015 | 10.1145/2751205.2751224 | International Conference on Supercomputing |
Keywords | Field | DocType |
SIMD, Vectorization, Compiler Optimization, OpenMP, Stencil, Intel Many Integrated Core Architecture | Program optimization,Computer science,Xeon Phi,Parallel computing,SIMD,Vectorization (mathematics),Optimizing compiler,Compiler,Coprocessor,Intrinsics | Conference |
Citations | PageRank | References |
4 | 0.39 | 17 |
Authors | ||
5 |
Name | Order | Citations | PageRank |
---|---|---|---|
Diego Caballero | 1 | 20 | 2.51 |
Sara Royuela | 2 | 24 | 5.23 |
Roger Ferrer | 3 | 39 | 4.04 |
Alejandro Duran | 4 | 943 | 61.43 |
Xavier Martorell | 5 | 1470 | 125.40 |