Title | ||
---|---|---|
A Model Driven Approach Towards Improving the Performance of Apache Spark Applications |
Abstract | ||
---|---|---|
Apache Spark applications often execute in multiple stages where each stage consists of multiple tasks running in parallel. However, prior efforts noted that the execution time of different tasks within a stage can vary significantly for various reasons (e.g., inefficient partition of input data), and tasks can be distributed unevenly across worker nodes for different reasons (e.g., data co-locality). While these problems are well-known, it is nontrivial to predict and address them effectively. In this paper we present an analytical model driven approach that can predict the possibility of such problems by executing an application with a limited amount of input data and recommend ways to address the identified problems by repartitioning input data (in case of task straggler problem) and/or changing the locality configuration setting (in case of skewed task distribution problem). The novelty of our approach lies in automatically predicting the potential problems a priori based on limited execution data and recommending the locality setting and partition number. Our experimental result using 9 Apache Spark applications on two different clusters shows that our model driven approach can predict these problems with high accuracy and improve the performance by up to 71%. |
Year | DOI | Venue |
---|---|---|
2019 | 10.1109/ISPASS.2019.00036 | 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) |
Keywords | Field | DocType |
Apache Spark,Task Imbalance,Straggler,Performance Modeling,Performance Optimization,Task Distribution,Configuration Tuning | Locality,Spark (mathematics),Computer science,Parallel computing,A priori and a posteriori,Execution time,Novelty,Partition (number theory),Distributed computing | Conference |
ISBN | Citations | PageRank |
978-1-7281-0746-2 | 0 | 0.34 |
References | Authors | |
0 | 4 |
Name | Order | Citations | PageRank |
---|---|---|---|
王克文 | 1 | 591 | 54.88 |
Mohammad Maifi Hasan Khan | 2 | 233 | 22.04 |
Nhan Nguyen | 3 | 46 | 6.33 |
Swapna S. Gokhale | 4 | 860 | 77.93 |