Title
SP-Partitioner: A novel partition method to handle intermediate data skew in spark streaming.
Abstract
Spark Streaming, a popular tool for processing live data streams, offers a good divide-and-conquer solution, where data stream is split into batches that are then processed in parallel by mappers, and the intermediate data from the mappers are finally reduced by reducers. However, one of the key issues with such an approach for live data processing is partitioning skew in which data distributed over the processing units are not balanced due to uncertainty of the coming data streams. This imbalance is rippled through the mappers and become prominent to the reducers, making reduce a performance bottleneck to the overall system. To address this issue, we present a Partitioner, SP-Partitioner, that sits between the map and reduce stages to re-balance the workload of the reducers. With our design, we treat the arrived batches of data as candidate samples and choose samples based on systematic sampling to predict the characteristics of intermediate data. According to the prediction, our method generates a reference table to guide the allocation of next batches of data evenly. We implement SP-Partitioner in Spark 1.6.1 and evaluate its performance with widely used applications. Experimental results conducted on a real VMs cluster show that our algorithms can not only achieve higher balancing performance on data with varying degree of data skew, but also decrease the average processing time of one batch of these data.
Year
DOI
Venue
2018
10.1016/j.future.2017.07.014
Future Generation Computer Systems
Keywords
Field
DocType
Spark streaming,Partitioning skew,Key skew,Prediction
Bottleneck,Data stream mining,Data processing,Spark (mathematics),Workload,Data stream,Computer science,Parallel computing,Real-time computing,Skew,Reference table,Distributed computing
Journal
Volume
ISSN
Citations 
86
0167-739X
4
PageRank 
References 
Authors
0.38
16
6
Name
Order
Citations
PageRank
Guipeng Liu1121.52
Xiaomin Zhu2921100.31
Ji Wang314012.56
Deke Guo47525.36
Weidong Bao5368.51
Hui Guo613417.47