Abstract | ||
---|---|---|
Training of convolutional neural networks (CNNs) on embedded platforms to support on-device learning is earning vital importance in recent days. Designing flexible training hardware is much more challenging than inference hardware, due to design complexity and large computation/memory requirement. In this work, we present an automatic compiler based FPGA accelerator with 16-bit fixed-point precision for complete CNN training, including Forward Pass (FP), Backward Pass (BP) and Weight Update (WU). We implemented an optimized RTL library to perform training-specific tasks and developed an RTL compiler to automatically generate FPGA-synthesizable RTL based on user-defined constraints. We present a new cyclic weight storage/access scheme for on-chip BRAM and off-chip DRAM to efficiently implement non-transpose and transpose operations during FP and BP phases, respectively. Representative CNNs for CIFAR-10 dataset are implemented and trained on Intel Stratix 10 GX FPGA using proposed hardware architecture, demonstrating up to 479 GOPS performance. |
Year | DOI | Venue |
---|---|---|
2019 | 10.1109/FPL.2019.00034 | 2019 29th International Conference on Field Programmable Logic and Applications (FPL) |
Keywords | Field | DocType |
Convolution neural networks, neural network training, back-propagation, hardware accelerator, FPGA | Stratix,Computer architecture,Computer science,Convolutional neural network,Parallel computing,Field-programmable gate array,Code generation,Compiler,Artificial intelligence,Hardware acceleration,Deep learning,Hardware architecture | Conference |
ISSN | ISBN | Citations |
1946-147X | 978-1-7281-4885-4 | 2 |
PageRank | References | Authors |
0.46 | 0 | 7 |
Name | Order | Citations | PageRank |
---|---|---|---|
Shreyas Kolala Venkataramanaiah | 1 | 2 | 1.13 |
Yu-Fei Ma | 2 | 1166 | 63.05 |
Shihui Yin | 3 | 71 | 10.03 |
Eriko Nurvitadhi | 4 | 399 | 33.08 |
Aravind Dasu | 5 | 10 | 4.47 |
Yu Cao | 6 | 2765 | 245.91 |
Jae-sun Seo | 7 | 536 | 56.32 |