Scale-out acceleration for machine learning. - Citegraph

Paper Info

Title
Scale-out acceleration for machine learning.

Abstract
The growing scale and complexity of Machine Learning (ML) algorithms has resulted in prevalent use of distributed general-purpose systems. In a rather disjoint effort, the community is focusing mostly on high performance single-node accelerators for learning. This work bridges these two paradigms and offers CoSMIC, a full computing stack constituting language, compiler, system software, template architecture, and circuit generators, that enable programmable acceleration of learning at scale. CoSMIC enables programmers to exploit scale-out acceleration using FPGAs and Programmable ASICs (P-ASICs) from a high-level and mathematical Domain-Specific Language (DSL). Nonetheless, CoSMIC does not require programmers to delve into the onerous task of system software development or hardware design. CoSMIC achieves three conflicting objectives of efficiency, automation, and programmability, by integrating a novel multi-threaded template accelerator architecture and a cohesive stack that generates the hardware and software code from its high-level DSL. CoSMIC can accelerate a wide range of learning algorithms that are most commonly trained using parallel variants of gradient descent. The key is to distribute partial gradient calculations of the learning algorithms across the accelerator-augmented nodes of the scale-out system. Additionally, CoSMIC leverages the parallelizability of the algorithms to offer multi-threaded acceleration within each node. Multi-threading allows CoSMIC to efficiently exploit the numerous resources that are becoming available on modern FPGAs/P-ASICs by striking a balance between multi-threaded parallelism and single-threaded performance. CoSMIC takes advantage of algorithmic properties of ML to offer a specialized system software that optimizes task allocation, role-assignment, thread management, and internode communication. We evaluate the versatility and efficiency of CoSMIC for 10 different machine learning applications from various domains. On average, a 16-node CoSMIC with UltraScale+ FPGAs offers 18.8× speedup over a 16-node Spark system with Xeon processors while the programmer only writes 22--55 lines of code. CoSMIC offers higher scalability compared to the state-of-the-art Spark; scaling from 4 to 16 nodes with CoSMIC yields 2.7× improvements whereas Spark offers 1.8×. These results confirm that the full-stack approach of CoSMIC takes an effective and vital step towards enabling scale-out acceleration for machine learning.

Year	DOI	Venue
2017	10.1145/3123939.3123979	MICRO-50: The 50th Annual IEEE/ACM International Symposium on Microarchitecture Cambridge Massachusetts October, 2017
Keywords	Field	DocType
Accelerator, scale-out, distributed, cloud, machine learning	Spark (mathematics),Computer science,Real-time computing,Software,Artificial intelligence,Source lines of code,Speedup,System software,Programmer,Parallel computing,Compiler,Machine learning,Scalability	Conference
ISSN	ISBN	Citations
1072-4451	978-1-4503-4952-9	17
PageRank	References	Authors
0.62	50	6

Authors (6 rows)

Cited by (17 rows)

References (50 rows)

Name	Order	Citations	PageRank
Jongse Park	1	303	12.47
Hardik Sharma	2	86	3.00
Divya Mahajan	3	122	6.40
Joon Kyung Kim	4	60	2.38
Preston Olds	5	17	0.62
H. Esmaeilzadeh	6	1443	69.71

1