Abstract | ||
---|---|---|
Distributed in-memory data processing engines accelerate iterative applications by caching substantial datasets in memory rather than recomputing them in each iteration. Selecting a suitable cluster size for caching these datasets plays an essential role in achieving optimal performance. In practice, this is a tedious and hard task for end users, who are typically not aware of cluster specifications, workload semantics and sizes of intermediate data. We present Blink, an autonomous sampling-based framework, which predicts sizes of cached datasets and selects optimal cluster size without relying on historical runs. We evaluate Blink on a variety of iterative, real-world, machine learning applications. With an average sample runs cost of 4.6% compared to the cost of optimal runs, Blink selects the optimal cluster size in 15 out of 16 cases, saving up to 47.4% of execution cost compared to average costs. |
Year | DOI | Venue |
---|---|---|
2022 | 10.1007/978-3-031-15743-1_14 | Symposium on Advances in Databases and Information Systems (ADBIS) |
DocType | Citations | PageRank |
Conference | 0 | 0.34 |
References | Authors | |
0 | 4 |
Name | Order | Citations | PageRank |
---|---|---|---|
Hani Al-Sayeh | 1 | 0 | 0.68 |
Muhammad Attahir Jibril | 2 | 0 | 1.01 |
Bunjamin Memishi | 3 | 0 | 0.68 |
Kai-uwe Sattler | 4 | 1144 | 126.81 |