Addressing the straggler problem for iterative convergent parallel ML. - Citegraph

Paper Info

Title
Addressing the straggler problem for iterative convergent parallel ML.

Abstract
FlexRR provides a scalable, efficient solution to the straggler problem for iterative machine learning (ML). The frequent (e.g., per iteration) barriers used in traditional BSP-based distributed ML implementations cause every transient slowdown of any worker thread to delay all others. FlexRR combines a more flexible synchronization model with dynamic peer-to-peer re-assignment of work among workers to address straggler threads. Experiments with real straggler behavior observed on Amazon EC2 and Microsoft Azure, as well as injected straggler behavior stress tests, confirm the significance of the problem and the effectiveness of FlexRR's solution. Using FlexRR, we consistently observe near-ideal run-times (relative to no performance jitter) across all real and injected straggler behaviors tested.

Year	DOI	Venue
2016	10.1145/2987550.2987554	SoCC
Field	DocType	Citations
Synchronization,Computer science,Parallel computing,Thread (computing),Real-time computing,Artificial intelligence,Jitter,Deep learning,Scalability	Conference	24
PageRank	References	Authors
0.82	40	8

Authors (8 rows)

Cited by (24 rows)

References (40 rows)

Name	Order	Citations	PageRank
Aaron Harlap	1	24	0.82
Henggang Cui	2	307	11.66
Wei Dai	3	333	12.77
Jinliang Wei	4	304	10.86
Gregory R. Ganger	5	4560	383.16
Phillip B. Gibbons	6	6863	624.14
Garth A. Gibson	7	2517	250.27
Bo Xing	8	7332	471.43

1