Title
Column Cache: Buffer Cache for Columnar Storage on HDFS
Abstract
Columnar storage is a data source for data analytics in distributed computing frameworks. For portability and scalability, columnar storage is built on top of existing distributed file systems with columnar data representations such as Parquet, RCFile, and ORC. However, these representations fail to utilize high-level information (e.g., columnar formats) for low-level disk buffer management in operating systems. As a result, data analytics workloads suffer from redundant memory buffers with expensive garbage collections, unnecessary disk readahead, and cache pollution in the operating system buffer cache.We propose column cache, which unifies and re-structures the buffers and caches of multiple software layers from columnar storage to operating systems. Column cache leverages high-level information such as file formats and query plans for enabling adaptive disk reads and cache eviction policies. We have developed a column cache prototype for Apache Parquet and observed that our prototype reduced redundant resource utilization in Apache Spark. Specifically, with our prototype, Spark showed a maximum speedup of 1.28x in TPC-DS workloads while increasing Linux page cache size by 18%, reducing total disk reads by 43%, and reducing garbage collection time in a Java virtual machine by 76%.
Year
DOI
Venue
2018
10.1109/BigData.2018.8622527
2018 IEEE International Conference on Big Data (Big Data)
Keywords
Field
DocType
low-level disk buffer management,data analytics workloads,redundant memory buffers,cache pollution,operating system buffer cache,cache eviction policies,column cache prototype,Linux page cache size,distributed computing frameworks,columnar data representations,columnar formats,high-level information,distributed file systems,reduced redundant resource utilization,columnar storage buffer cache,Java virtual machine
File format,Data mining,Disk buffer,Cache pollution,Computer science,Cache,Page cache,Garbage collection,Operating system,Speedup,Scalability
Conference
ISSN
ISBN
Citations 
2639-1589
978-1-5386-5036-3
0
PageRank 
References 
Authors
0.34
0
3
Name
Order
Citations
PageRank
T. Yoshimura114827.19
tatsuhiro chiba221.76
Hiroshi Horii315315.77