Title
Compact Filters for Fast Online Data Partitioning
Abstract
We are approaching a point in time when it will be infeasible to catalog and query data after it has been generated. This trend has fueled research on in-situ data processing (i.e. operating on data as it is streamed to storage). One important example of this approach is in-situ data indexing. Prior work has shown the feasibility of indexing at scale as a two-step process. First, one partitions data by key across the CPU cores of a parallel job. Then each core indexes its subset as data is persisted. Online partitioning requires transferring data over the network so that it can be indexed and stored by the core responsible for the data. This approach is becoming increasingly costly as new computing platforms emphasize parallelism instead of individual core performance that is crucial for communication libraries and systems software in general. In addition to indexing, scalable online data partitioning is also useful in other contexts such as load balancing and efficient compression. We present FilterKV, an efficient data management scheme for fast online data partitioning of key-value (KV) pairs. FilterKV reduces the total amount of data sent over the network and to storage. We achieve this by: (a) partitioning pointers to KV pairs instead of the KV pairs themselves and (b) using a compact format to represent and store KV pointers. Results from LANL show that FilterKV can reduce total write slowdown (including partitioning overhead) by up to 3x across 4096 CPU cores.
Year
DOI
Venue
2019
10.1109/CLUSTER.2019.8890992
2019 IEEE International Conference on Cluster Computing (CLUSTER)
Keywords
Field
DocType
data processing,in-situ data indexing,data partitioning,core indexes,online partitioning,scalable online data partitioning,data management scheme,fast online data partitioning,partitioning pointers,CPU cores,catalog,query data,partitioning overhead
Pointer (computer programming),Central processing unit,Load balancing (computing),Computer science,Parallel computing,Search engine indexing,Multi-core processor,Data management,Benchmark (computing),Distributed computing,Scalability
Conference
ISSN
ISBN
Citations 
1552-5244
978-1-7281-4735-2
0
PageRank 
References 
Authors
0.34
29
8
Name
Order
Citations
PageRank
Qing Zheng1915.40
Charles D. Cranor258252.19
Ankush Jain300.34
Gregory R. Ganger44560383.16
Garth A. Gibson52517250.27
George Amvrosiadis611110.40
Bradley W. Settlemyer712013.00
Gary Grider825316.11