Title
Efficient and Scalable Functional Dependency Discovery on Distributed Data-Parallel Platforms
Abstract
Functional dependencies (FDs) play a very important role in many data management tasks such as schema normalization, data cleaning, and query optimization. Meanwhile, there are ever-increasing application demands for efficient FD discovery on large-scale datasets. Unfortunately, due to huge runtime and memory overhead, the existing single-machine FD discovery algorithms are inefficient for large-scale datasets. Recently, distributed data-parallel computing has become the de facto standard for large-scale data processing. However, it is challenging to design an efficient distributed FD discovery algorithm. In this paper, we present SmartFD, which is an efficient and scalable algorithm for distributed FD discovery. First, we propose a novel attribute sorting-based algorithm framework. Next, to discover all the FDs grouped by a given attribute, we propose an efficient distributed algorithm Attribute-centric Functional Dependency Discovery (AFDD). In AFDD, we design an Fast Sampling and Early Aggregation (FSEA) mechanism to improve the efficiency of distributed sampling and propose a memory-efficient index-based method for distributed FD validation. Moreover, AFDD employs an attribute-parallel method to accelerate the pruning-and-generation of candidate FDs. Furthermore, we propose an adaptive switching strategy between distributed sampling and distributed validation based on the unified time-based efficiency metric. Also, we employ a distributed probing based method to make the switching strategy more accurate. Experimental results on Apache Spark reveal that SmartFD outperforms the state-of-the-art single-machine algorithm HyFD and the existing distributed algorithm HFDD with 3.2×–44.9× and 2.5×–455.7× speedup respectively. Moreover, SmartFD achieves good row scalability and column scalability. Additionally, SmartFD has sub-linear node scalability.
Year
DOI
Venue
2019
10.1109/TPDS.2019.2925014
IEEE Transactions on Parallel and Distributed Systems
Keywords
Field
DocType
Distributed databases,Scalability,Remuneration,Lattices,Distributed algorithms,Switches,Query processing
Query optimization,De facto standard,Spark (mathematics),Computer science,Sorting,Distributed algorithm,Distributed database,Distributed computing,Scalability,Speedup
Journal
Volume
Issue
ISSN
30
12
1045-9219
Citations 
PageRank 
References 
2
0.43
0
Authors
6
Name
Order
Citations
PageRank
Guanghui Zhu132.15
Qian Wang220.76
Qiwei Tang320.43
Rong Gu411017.77
Chunfeng Yuan556.90
Yihua Huang686.61