Title
A Model Driven Approach Towards Improving the Performance of Apache Spark Applications
Abstract
Apache Spark applications often execute in multiple stages where each stage consists of multiple tasks running in parallel. However, prior efforts noted that the execution time of different tasks within a stage can vary significantly for various reasons (e.g., inefficient partition of input data), and tasks can be distributed unevenly across worker nodes for different reasons (e.g., data co-locality). While these problems are well-known, it is nontrivial to predict and address them effectively. In this paper we present an analytical model driven approach that can predict the possibility of such problems by executing an application with a limited amount of input data and recommend ways to address the identified problems by repartitioning input data (in case of task straggler problem) and/or changing the locality configuration setting (in case of skewed task distribution problem). The novelty of our approach lies in automatically predicting the potential problems a priori based on limited execution data and recommending the locality setting and partition number. Our experimental result using 9 Apache Spark applications on two different clusters shows that our model driven approach can predict these problems with high accuracy and improve the performance by up to 71%.
Year
DOI
Venue
2019
10.1109/ISPASS.2019.00036
2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)
Keywords
Field
DocType
Apache Spark,Task Imbalance,Straggler,Performance Modeling,Performance Optimization,Task Distribution,Configuration Tuning
Locality,Spark (mathematics),Computer science,Parallel computing,A priori and a posteriori,Execution time,Novelty,Partition (number theory),Distributed computing
Conference
ISBN
Citations 
PageRank 
978-1-7281-0746-2
0
0.34
References 
Authors
0
4
Name
Order
Citations
PageRank
王克文159154.88
Mohammad Maifi Hasan Khan223322.04
Nhan Nguyen3466.33
Swapna S. Gokhale486077.93