Title
Planaria: Dynamic Architecture Fission for Spatial Multi-Tenant Acceleration of Deep Neural Networks
Abstract
Deep Neural Networks (DNNs) have reinvigorated real-world applications that rely on learning patterns of data and are permeating into different industries and markets. Cloud infrastructure and accelerators that offer INFerence-as-a-Service (INFaaS) have become the enabler of this rather quick and invasive shift in the industry. To that end, mostly accelerator-based INFaaS (Google's TPU [1], NVIDIA T4 [2], Microsoft Brainwave [3], etc.) has become the backbone of many real-life applications. However, as the demand for such services grows, merely scaling-out the number of accelerators is not economically cost-effective. Although multi-tenancy has propelled datacenter scalability, it has not been a primary factor in designing DNN accelerators due to the arms race for higher speed and efficiency. This paper sets out to explore this timely requirement of multi-tenancy through a new dimension: dynamic architecture fission. To that end, we define Planaria <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sup> that can dynamically fission (break) into multiple smaller yet full-fledged DNN engines at runtime. This microarchitectural capability enables spatially co-locating multiple DNN inference services on the same hardware, offering simultaneous multi-tenant DNN acceleration. To realize this dynamic reconfigurability, we first devise breakable omni-directional systolic arrays for DNN acceleration that allows omni-directional flow of data. Second, it uses this capability and a unique organization of on-chip memory, interconnection, and compute resources to enable fission in systolic array based DNN accelerators. Architecture fission and its associated flexibility enables an extra degree of freedom for task scheduling, that even allows breaking the accelerator with regard to the server load, DNN topology, and task priority. As such, it can simultaneously co-locate DNNs to enhance utilization, throughput, QoS, and fairness. We compare the proposed design to PREMA [4], a recent effort that offers multi-tenancy by time-multiplexing the DNN accelerator across multiple tasks. We use the same frequency, the same amount of compute and memory resources for both accelerators. The results show significant benefits with (soft, medium, hard) QoS requirements, in throughput (7.4×, 7.2×, 12.2×), SLA satisfaction rate (45%, 15%, 16%), and fairness (2.1×, 2.3×, 1.9×).
Year
DOI
Venue
2020
10.1109/MICRO50266.2020.00062
2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)
Keywords
DocType
ISBN
Accelerators,Deep Neural Networks,DNN,DNN Acceleration,Multi-Tenancy,Spatial DNN Task Co-Location,Multi-Tenant DNN Acceleration,Dynamic Architecture Fission,Omni-Directional Systolic Arrays
Conference
978-1-7281-7384-9
Citations 
PageRank 
References 
5
0.44
51
Authors
12
Name
Order
Citations
PageRank
Soroush Ghodrati1131.94
Byung Hoon Ahn250.44
Joon Kyung Kim3602.38
Sean Kinzer472.14
Brahmendra Reddy Yatham550.44
Navateja Alla650.44
Hardik Sharma750.44
Alian, M.8386.22
Eiman Ebrahimi947917.23
Nam Sung Kim103268225.99
Cliff Young1141146.31
H. Esmaeilzadeh12144369.71