Title
PLI$$^+$$+: efficient clustering of cloud databases
Abstract
Commercial cloud database services increase availability of data and provide reliable access to data. Routine database maintenance tasks such as clustering, however, increase the costs of hosting data on commercial cloud instances. Clustering causes an I/O burst; clustering in one-shot depletes I/O credit accumulated by an instance and increases the cost of hosting data. An unclustered database decreases query performance by scanning large amounts of data, gradually depleting I/O credits. In this paper, we introduce Physical Location Index Plus (\({PLI}^{\small {{+}}}\)), an indexing method for databases hosted on commercial cloud. \({PLI}^{\small {{+}}}\) relies on internal knowledge of data layout, building a physical location index, which maps a range of physical co-locations with a range of attribute values to create approximately sorted buckets. As new data is inserted, writes are partitioned in memory based on incoming data distribution. The data is written to physical locations on disk in block-based partitions to favor large granularity I/O. Incoming SQL queries on indexed attribute values are rewritten in terms of the physical location ranges. As a result, \({PLI}^{\small {{+}}}\) does not decrease query performance on an unclustered cloud database instance, DBAs may choose to cluster the instance when they have sufficiently large I/O credit available for clustering thus delaying the need for clustering. We evaluate query performance over \({PLI}^{\small {{+}}}\) by comparing it with clustered, unclustered (secondary) indexes, and log-structured merge trees on real datasets. Experiments show that \({PLI}^{\small {{+}}}\) significantly delays clustering, and yet does not degrade query performance—thus achieving higher level of sortedness than unclustered indexes and log-structured merge trees. We also evaluate the quality of clustering by introducing a measure of interval sortedness, and the size of index.
Year
DOI
Venue
2019
10.1007/s10619-018-7252-2
Distributed and Parallel Databases
Keywords
Field
DocType
Clustered indexes, Relational databases, Scientific data and computing
SQL,Relational database,Computer science,Search engine indexing,Granularity,Cluster analysis,Data access,Database,Cloud database,Cloud computing
Journal
Volume
Issue
ISSN
37.0
SP1
1573-7578
Citations 
PageRank 
References 
1
0.35
17
Authors
4
Name
Order
Citations
PageRank
Dai Hai Ton That182.24
James Wagner2165.56
Alexander Rasin32950209.48
Tanu Malik430435.97