Abstract | ||
---|---|---|
ABSTRACTThe increasing availability of structured datasets, from Web tables and open-data portals to enterprise data, opens up opportunities to enrich analytics and improve machine learning models through relational data augmentation. In this paper, we introduce a new class of data augmentation queries: join-correlation queries. Given a column Q and a join column KQ from a query table TQ, retrieve tables TX in a dataset collection such that TX is joinable with TQ on KQ and there is a column C ∈ TX such that Q is correlated with C. A naïve approach to evaluate these queries, which first finds joinable tables and then explicitly joins and computes correlations between Q and all columns of the discovered tables, is prohibitively expensive. To efficiently support correlated column discovery, we 1) propose a sketching method that enables the construction of an index for a large number of tables and that provides accurate estimates for join-correlation queries, and 2) explore different scoring strategies that effectively rank the query results based on how well the columns are correlated with the query. We carry out a detailed experimental evaluation, using both synthetic and real data, which shows that our sketches attain high accuracy and the scoring strategies lead to high-quality rankings. |
Year | DOI | Venue |
---|---|---|
2021 | 10.1145/3448016.3458456 | International Conference on Management of Data |
DocType | ISSN | Citations |
Conference | 0730-8078 | 0 |
PageRank | References | Authors |
0.34 | 0 | 5 |
Name | Order | Citations | PageRank |
---|---|---|---|
Aécio S. R. Santos | 1 | 22 | 4.84 |
Aline Bessa | 2 | 5 | 2.80 |
Fernando Seabra Chirigati | 3 | 205 | 16.38 |
Christopher Musco | 4 | 1 | 1.38 |
Juliana Freire | 5 | 3956 | 270.89 |