CHOPPER: Optimizing Data Partitioning for In-memory Data Analytics Frameworks - Citegraph

Paper Info

Title
CHOPPER: Optimizing Data Partitioning for In-memory Data Analytics Frameworks

Abstract
The performance of in-memory based data analytic frameworks such as Spark is significantly affected by how data is partitioned. This is because the partitioning effectively determines task granularity and parallelism. Moreover, different phases of a workload execution can have different optimal partitions. However, in the current implementations, the tuning knobs controlling the partitioning are either configured statically or involve a cumbersome programmatic process for affecting changes at runtime. In this paper, we propose CHOPPER, a system for automatically determining the optimal number of partitions for each phase of a workload and dynamically changing the partition scheme during workload execution. CHOPPER monitors the task execution and DAG scheduling information to determine the optimal level of parallelism. CHOPPER repartitions data as needed to ensure efficient task granularity, avoids data skew, and reduces shuffle traffic. Thus, CHOPPER allows users to write applications without having to hand-tune for optimal parallelism. Experimental results show that CHOPPER effectively improves workload performance by up to 35.2% compared to standard Spark setup.

Year	DOI	Venue
2016	10.1109/CLUSTER.2016.41	2016 IEEE International Conference on Cluster Computing (CLUSTER)
Keywords	Field	DocType
DAG scheduling,data analytics,data partitioning,in-memory,shuffle stage,Spark,task parallelism	Job shop scheduling,Spark (mathematics),Data analysis,Computer science,Workload,Scheduling (computing),Parallel computing,Real-time computing,Skew,Granularity,Chopper,Distributed computing	Conference
ISSN	ISBN	Citations
1552-5244	978-1-5090-3654-7	1
PageRank	References	Authors
0.37	15	6

Authors (6 rows)

Cited by (1 rows)

References (15 rows)

Name	Order	Citations	PageRank
Arnab Kumar Paul	1	12	2.73
Wenjie Zhuang	2	40	1.78
Luna Xu	3	25	2.99
Min Li	4	115	5.28
M. Mustafa Rafique	5	157	15.49
Ali R. Butt	6	651	47.51

1