Title
From performance profiling to predictive analytics while evaluating hadoop cost-efficiency in ALOJA
Abstract
During the past years the exponential growth of data, its generation speed, and its expected consumption rate presents one of the most important challenges in IT both for industry and research. For these reasons, the ALOJA research project was created by BSC and Microsoft as an open initiative to increase cost-efficiency and the general understanding of Big Data systems via automation and learning. The development of the project over its first year, has resulted in a open source benchmarking platform used to produce the largest public repository of Big Data results1, featuring over 42,000 job execution details. ALOJA also includes web-based analytic tools to evaluate and gather insights about cost-performance of benchmarked systems. The tools offer means to extract knowledge that can lead to optimize configuration and deployment options in the Cloud i.e., selecting the most cost-effective VMs and cluster sizes. This article describes the evolution of the project focus and research lines, for a period of over a year while continuously benchmarking systems for Big Data. As well discusses the motivation ¿ both technical and market-based ¿ of such changes. It also presents the main results from the evaluation of different OS and Hadoop configurations, covering over 100 hardware deployments. During this time, ALOJA's initial target has shifted from a previous low-level profiling of Hadoop runtime with HPC tools, passing through extensive benchmarking and evaluation of a large body of results via aggregation, to currently leveraging Predictive Analytics (PA) techniques. The ongoing efforts in PA show promising results to automatically model the behavior of systems i.e., predicting job execution times with high accuracy or to reduce the number of benchmark runs needed. As well as for Knowledge Discovery (KD) to find relations among software and hardware components. Techniques that jointly support foresighting cost-effectiveness of new defined systems, reducing benchmarking time and costs.
Year
DOI
Venue
2015
10.1109/BigData.2015.7363876
Big Data
Field
DocType
Citations 
Data science,Data mining,Computer science,Predictive analytics,Profiling (computer programming),Software,Knowledge extraction,Big data,Benchmarking,Benchmark (computing),Cloud computing
Conference
3
PageRank 
References 
Authors
0.54
11
9
Name
Order
Citations
PageRank
Nicolas Poggi11099.44
Josep Lluis Berral213211.86
David Carrera322116.12
aaron call4141.99
Fabrizio Gagliardi530.54
rob reinauer6131.62
Nikola Vujic730.87
Daron Green8162.30
José A. Blakeley9642207.43