Optimizing Weight Mapping and Data Flow for Convolutional Neural Networks on Processing-in-Memory Architectures - Citegraph

Paper Info

Title
Optimizing Weight Mapping and Data Flow for Convolutional Neural Networks on Processing-in-Memory Architectures

Abstract
Recent state-of-the-art deep convolutional neural networks (CNNs) have shown remarkable success in current intelligent systems for various tasks, such as image/speech recognition and classification. A number of recent efforts have attempted to design custom inference engines based on processing-in-memory (PIM) architecture, where the memory array is used for weighted sum computation, thereby avoiding the frequent data transfer between buffers and computation units. Prior PIM designs typically unroll each 3D kernel of the convolutional layers into a vertical column of a large weight matrix, where the input data needs to be accessed multiple times. In this paper, in order to maximize both weight and input data reuse for PIM architecture, we propose a novel weight mapping method and the corresponding data flow which divides the kernels and assign the input data into different processing-elements (PEs) according to their spatial locations. As a case study, resistive random access memory (RRAM) based 8-bit PIM design at 32 nm is benchmarked. The proposed mapping method and data flow yields <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\sim 2.03\times $ </tex-math></inline-formula> speed up and <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\sim 1.4\times $ </tex-math></inline-formula> improvement in throughput and energy efficiency for ResNet-34, compared with the prior design based on the conventional mapping method. To further optimize the hardware performance and throughput, we propose an optimal pipeline architecture, with ~50% area overhead, it achieves overall <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$913\times $ </tex-math></inline-formula> and <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$1.96\times $ </tex-math></inline-formula> improvement in throughput and energy efficiency, which are 132476 FPS and 20.1 TOPS/W, respectively.

Year	DOI	Venue
2020	10.1109/TCSI.2019.2958568	IEEE Transactions on Circuits and Systems I: Regular Papers
Keywords	DocType	Volume
Kernel,Random access memory,Three-dimensional displays,Arrays,Throughput,System-on-chip	Journal	67
Issue	ISSN	Citations
4	1549-8328	7
PageRank	References	Authors
0.50	0	3

Authors (3 rows)

Cited by (7 rows)

References (0 rows)

Name	Order	Citations	PageRank
Xiaochen Peng	1	61	12.17
Rui Liu	2	47	5.32
Shimeng Yu	3	490	56.22

1