Title
Addressing the straggler problem for iterative convergent parallel ML.
Abstract
FlexRR provides a scalable, efficient solution to the straggler problem for iterative machine learning (ML). The frequent (e.g., per iteration) barriers used in traditional BSP-based distributed ML implementations cause every transient slowdown of any worker thread to delay all others. FlexRR combines a more flexible synchronization model with dynamic peer-to-peer re-assignment of work among workers to address straggler threads. Experiments with real straggler behavior observed on Amazon EC2 and Microsoft Azure, as well as injected straggler behavior stress tests, confirm the significance of the problem and the effectiveness of FlexRR's solution. Using FlexRR, we consistently observe near-ideal run-times (relative to no performance jitter) across all real and injected straggler behaviors tested.
Year
DOI
Venue
2016
10.1145/2987550.2987554
SoCC
Field
DocType
Citations 
Synchronization,Computer science,Parallel computing,Thread (computing),Real-time computing,Artificial intelligence,Jitter,Deep learning,Scalability
Conference
24
PageRank 
References 
Authors
0.82
40
8
Name
Order
Citations
PageRank
Aaron Harlap1240.82
Henggang Cui230711.66
Wei Dai333312.77
Jinliang Wei430410.86
Gregory R. Ganger54560383.16
Phillip B. Gibbons66863624.14
Garth A. Gibson72517250.27
Bo Xing87332471.43