Towards Memory-Optimized Data Shuffling Patterns for Big Data Analytics - Citegraph

Paper Info

Title
Towards Memory-Optimized Data Shuffling Patterns for Big Data Analytics

Abstract
Big data analytics is an indispensable tool in transforming science, engineering, medicine, healthcare, finance and ultimately business itself. With the explosion of data sizes and need for shorter time-to-solution, in-memory platforms such as Apache Spark gain increasing popularity. However, this introduces important challenges, among which data shuffling is particularly difficult: on one hand it is a key part of the computation that has a major impact on the overall performance and scalability so its efficiency is paramount, while on the other hand it needs to operate with scarce memory in order to leave as much memory available for data caching. In this context, efficient scheduling of data transfers such that it addresses both dimensions of the problem simultaneously is non-trivial. State-of-the-art solutions often rely on simple approaches that yield sub optimal performance and resource usage. This paper contributes a novel shuffle data transfer strategy that dynamically adapts to the computation with minimal memory utilization, which we briefly underline as a series of design principles.

Year	DOI	Venue
2016	10.1109/CCGrid.2016.85	2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)
Keywords	Field	DocType
big data analytics,data shuffling,memory-efficient I/O,elastic buffering	Spark (mathematics),Data transmission,Scheduling (computing),Computer science,Shuffling,Memory management,Big data,Distributed computing,Computation,Scalability	Conference
ISSN	ISBN	Citations
2376-4414	978-1-5090-2454-4	2
PageRank	References	Authors
0.38	13	5

Authors (5 rows)

Cited by (2 rows)

References (13 rows)

Name	Order	Citations	PageRank
Bogdan Nicolae	1	392	29.51
Carlos H. A. Costa	2	20	3.26
Claudia Misale	3	23	5.44
Kostas Katrinis	4	102	19.41
Yoonho Park	5	350	35.57

1