More Effective Synchronization Scheme In Ml Using Stale Parameters - Citegraph

Paper Info

Title
More Effective Synchronization Scheme In Ml Using Stale Parameters

Abstract
In Machine learning (ML) the model we use is increasingly important, and the model's parameters, the key point of the ML, are adjusted through iteratively processing a training dataset until convergence. Although data-parallel ML systems often engage a perfect error tolerance when synchronizing the model parameters for maximizing parallelism, the synchronization of model parameters may delay in completion, a problem that generally gets worse at a large scale. This paper presents a Bounded Asynchronous Parallel (BAP) model of computation that allows computations using stale model parameters in order to reduce synchronization overheads. In the meanwhile, our BAP model ensures theoretical convergence guarantees for large scale data-parallel ML applications. This model permits distributed workers to use the stale parameters storing in the local cache, instead of waiting until the Parameter Server (PS) produces a new version. This expressively reduces the time workers spend on waiting. Furthermore, the BAP model guarantees the convergence of ML algorithm by bounding the maximum distance of the stale parameters. Experiments conducted on 4 cluster nodes with up to 32 GPUs showed that our model significantly improved the proportion of computing time relative to the waiting time and led to 1.2-2 X speedup. Besides, we elaborated how to choose the staleness threshold when considering the tradeoff between Efficiency and Speed.

Year	DOI	Venue
2016	10.1109/HPCC-SmartCity-DSS.2016.131	PROCEEDINGS OF 2016 IEEE 18TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS; IEEE 14TH INTERNATIONAL CONFERENCE ON SMART CITY; IEEE 2ND INTERNATIONAL CONFERENCE ON DATA SCIENCE AND SYSTEMS (HPCC/SMARTCITY/DSS)
Keywords	Field	DocType
distributed systems, Bounded Asynchronous Parallel, Bulk Synchronous Parallel, Total Asynchronous Parallel, stale parameters, tradeoff	Convergence (routing),Asynchronous communication,Data modeling,Synchronization,Computer science,Cache,Server,Model of computation,Distributed computing,Bounding overwatch	Conference
Citations	PageRank	References
0	0.34	0
Authors
4

Authors (4 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Yabin Li	1	0	0.34
Han Wan	2	28	10.98
Bo Jiang	3	271	19.91
Xiang Long	4	7	2.17

1