Managed communication and consistency for fast data-parallel iterative analytics - Citegraph

Paper Info

Title
Managed communication and consistency for fast data-parallel iterative analytics

Abstract
At the core of Machine Learning (ML) analytics is often an expert-suggested model, whose parameters are refined by iteratively processing a training dataset until convergence. The completion time (i.e. convergence time) and quality of the learned model not only depends on the rate at which the refinements are generated but also the quality of each refinement. While data-parallel ML applications often employ a loose consistency model when updating shared model parameters to maximize parallelism, the accumulated error may seriously impact the quality of refinements and thus delay completion time, a problem that usually gets worse with scale. Although more immediate propagation of updates reduces the accumulated error, this strategy is limited by physical network bandwidth. Additionally, the performance of the widely used stochastic gradient descent (SGD) algorithm is sensitive to step size. Simply increasing communication often fails to bring improvement without tuning step size accordingly and tedious hand tuning is usually needed to achieve optimal performance. This paper presents Bösen, a system that maximizes the network communication efficiency under a given inter-machine network bandwidth budget to minimize parallel error, while ensuring theoretical convergence guarantees for large-scale data-parallel ML applications. Furthermore, Bösen prioritizes messages most significant to algorithm convergence, further enhancing algorithm convergence. Finally, Bösen is the first distributed implementation of the recently presented adaptive revision algorithm, which provides orders of magnitude improvement over a carefully tuned fixed schedule of step size refinements for some SGD algorithms. Experiments on two clusters with up to 1024 cores show that our mechanism significantly improves upon static communication schedules.

Year	DOI	Venue
2015	10.1145/2806777.2806778	IEEE International System-on-Chip (SoC) Conference
Field	DocType	Citations
Convergence (routing),Physical network,Stochastic gradient descent,Computer science,Real-time computing,Provisioning,Schedule,Bandwidth (signal processing),Consistency model,Analytics	Conference	35
PageRank	References	Authors
1.13	30	9

Authors (9 rows)

Cited by (35 rows)

References (30 rows)

Name	Order	Citations	PageRank
Jinliang Wei	1	304	10.86
Wei Dai	2	333	12.77
Aurick Qiao	3	45	2.68
Ho, Qirong	4	636	30.75
Henggang Cui	5	36	2.89
Gregory R. Ganger	6	4560	383.16
Phillip B. Gibbons	7	6863	624.14
Garth A. Gibson	8	2517	250.27
Bo Xing	9	7332	471.43

1