Title
Analysis of SQL Workloads on an Enterprise Datalake
Abstract
Over the last three years we have been running a large-scale data processing platform for applying analytics to corporate data on a private cloud instance. We control every level in the stack from the processing engines down to the hardware. One very common pattern of usage is for data scientists to use SQL/Hadoop to explore and analysis data sets. Data scientists are free to run whatever queries they want on this shared environment. Here we report on the patterns of usage of data scientists and the measured performance of the queries they create. We motivate why it is difficult to estimate the resource usage of a SQL query on such a system ahead of time and explain the consequences for the design of enterprise datalakes.
Year
DOI
Venue
2020
10.1109/CLOUD49709.2020.00016
2020 IEEE 13th International Conference on Cloud Computing (CLOUD)
Keywords
DocType
ISSN
monitoring,private cloud,SQL/Hadoop,Datalake
Conference
2159-6182
ISBN
Citations 
PageRank 
978-1-7281-8781-5
0
0.34
References 
Authors
5
3
Name
Order
Citations
PageRank
Luis Garces-Erice1173.68
Sean Rooney28112.50
Zoltán A. Nagy300.34