Title
A Method for Optimizing Opaque Filter Queries
Abstract
An important class of database queries in machine learning and data science workloads is the opaque filter query: a query with a selection predicate that is implemented with a UDF, with semantics that are unknown to the query optimizer. Some typical examples would include a CNN-style trained image classifier, or a textual sentiment classifier. Because the optimizer does not know the predicate's semantics, it cannot employ standard optimizations, yielding long query times. We propose voodoo indexing, a two-phase method for optimizing opaque filter queries. Before any query arrives, the method builds a hierarchical "query-independent" index of the database contents, which groups together similar objects. At query-time, the method builds a map of how much each group satisfies the predicate, while also exploiting the map to accelerate execution. Unlike past methods, voodoo indexing does not require insight into predicate semantics, works on any data type, and does not require in-query model training. We describe both standalone and SparkSQL-specific implementations, plus experiments on both image and text data, on more than 100 distinct opaque predicates. We show voodoo indexing can yield up to an 88% improvement over standard scan behavior, and a 79% improvement over the previous best method adapted from research literature.
Year
DOI
Venue
2020
10.1145/3318464.3389766
SIGMOD/PODS '20: International Conference on Management of Data Portland OR USA June, 2020
Keywords
DocType
ISBN
User-defined function (UDF) optimization, opaque filter query, hierarchical multi-armed bandit algorithm
Conference
978-1-4503-6735-6
Citations 
PageRank 
References 
0
0.34
0
Authors
4
Name
Order
Citations
PageRank
Wenjia He100.34
Michael Anderson212519.21
Maxwell Strome300.34
Michael J. Cafarella42246144.15