Title
On How to Accelerate Iterative Stencil Loops: A Scalable Streaming-Based Approach
Abstract
In high-performance systems, stencil computations play a crucial role as they appear in a variety of different fields of application, ranging from partial differential equation solving, to computer simulation of particles’ interaction, to image processing and computer vision. The computationally intensive nature of those algorithms created the need for solutions to efficiently implement them in order to save both execution time and energy. This, in combination with their regular structure, has justified their widespread study and the proposal of largely different approaches to their optimization. However, most of these works are focused on aggressive compile time optimization, cache locality optimization, and parallelism extraction for the multicore/multiprocessor domain, while fewer works are focused on the exploitation of custom architectures to further exploit the regular structure of Iterative Stencil Loops (ISLs), specifically with the goal of improving power efficiency. This work introduces a methodology to systematically design power-efficient hardware accelerators for the optimal execution of ISL algorithms on Field-programmable Gate Arrays (FPGAs). As part of the methodology, we introduce the notion of Streaming Stencil Time-step (SST), a streaming-based architecture capable of achieving both low resource usage and efficient data reuse thanks to an optimal data buffering strategy, and we introduce a technique called SSTs queuing that is capable of delivering a pseudolinear execution time speedup with constant bandwidth. The methodology has been validated on significant benchmarks on a Virtex-7 FPGA using the Xilinx Vivado suite. Results demonstrate how the efficient usage of the on-chip memory resources realized by an SST allows one to treat problem sizes whose implementation would otherwise not be possible via direct synthesis of the original, unmanipulated code via High-Level Synthesis (HLS). We also show how the SSTs queuing effectively ensures a pseudolinear throughput speedup while consuming constant off-chip bandwidth.
Year
DOI
Venue
2016
10.1145/2842615
ACM Transactions on Architecture and Code Optimization (TACO)
Keywords
Field
DocType
fpgas,power efficiency
Computer science,Compile time,Parallel computing,Stencil,Field-programmable gate array,Multiprocessing,Real-time computing,Throughput,Multi-core processor,Speedup,Scalability
Journal
Volume
Issue
ISSN
12
4
1544-3566
Citations 
PageRank 
References 
12
0.80
39
Authors
5
Name
Order
Citations
PageRank
Riccardo Cattaneo1579.14
giuseppe natale2120.80
carlo sicignano3120.80
D. Sciuto41720176.61
Marco D. Santambrogio577191.15