On How to Accelerate Iterative Stencil Loops: A Scalable Streaming-Based Approach - Citegraph

Paper Info

Title
On How to Accelerate Iterative Stencil Loops: A Scalable Streaming-Based Approach

Abstract
In high-performance systems, stencil computations play a crucial role as they appear in a variety of different fields of application, ranging from partial differential equation solving, to computer simulation of particles’ interaction, to image processing and computer vision. The computationally intensive nature of those algorithms created the need for solutions to efficiently implement them in order to save both execution time and energy. This, in combination with their regular structure, has justified their widespread study and the proposal of largely different approaches to their optimization. However, most of these works are focused on aggressive compile time optimization, cache locality optimization, and parallelism extraction for the multicore/multiprocessor domain, while fewer works are focused on the exploitation of custom architectures to further exploit the regular structure of Iterative Stencil Loops (ISLs), specifically with the goal of improving power efficiency. This work introduces a methodology to systematically design power-efficient hardware accelerators for the optimal execution of ISL algorithms on Field-programmable Gate Arrays (FPGAs). As part of the methodology, we introduce the notion of Streaming Stencil Time-step (SST), a streaming-based architecture capable of achieving both low resource usage and efficient data reuse thanks to an optimal data buffering strategy, and we introduce a technique called SSTs queuing that is capable of delivering a pseudolinear execution time speedup with constant bandwidth. The methodology has been validated on significant benchmarks on a Virtex-7 FPGA using the Xilinx Vivado suite. Results demonstrate how the efficient usage of the on-chip memory resources realized by an SST allows one to treat problem sizes whose implementation would otherwise not be possible via direct synthesis of the original, unmanipulated code via High-Level Synthesis (HLS). We also show how the SSTs queuing effectively ensures a pseudolinear throughput speedup while consuming constant off-chip bandwidth.

Year	DOI	Venue
2016	10.1145/2842615	ACM Transactions on Architecture and Code Optimization (TACO)
Keywords	Field	DocType
fpgas,power efficiency	Computer science,Compile time,Parallel computing,Stencil,Field-programmable gate array,Multiprocessing,Real-time computing,Throughput,Multi-core processor,Speedup,Scalability	Journal
Volume	Issue	ISSN
12	4	1544-3566
Citations	PageRank	References
12	0.80	39
Authors
5

Authors (5 rows)

Cited by (12 rows)

References (39 rows)

Name	Order	Citations	PageRank
Riccardo Cattaneo	1	57	9.14
giuseppe natale	2	12	0.80
carlo sicignano	3	12	0.80
D. Sciuto	4	1720	176.61
Marco D. Santambrogio	5	771	91.15

1