Title
Scalable and Flexible Two-Phase Ensemble Algorithms for Causality Discovery
Abstract
Causality study investigates cause-effect relationships among different variables of a system and has been widely used in many disciplines including climatology and neuroscience. To discover causal relationships, many data-driven causality discovery methods, e.g., Granger causality, PCMCI and Dynamic Bayesian Network, have been proposed. Many of these causality discovery approaches mine time-series data and generate a directed causality graph where each graph edge denotes a cause-effect relationship between the two connected graph nodes. Our benchmarking of different causality discovery approaches with real world climate data show these approaches often generate quite different causality results with the same input dataset due to their internal learning mechanism differences. Meanwhile, there are ever-increasing available data in virtually every discipline, which makes it more and more difficult to use existing causality discovery algorithms to produce causality results within reasonable time. To address these two challenges, this paper utilizes data partitioning and ensemble techniques, and proposes a flexible twophase causality ensemble framework. The framework first conducts phase 1 ensemble for partitioned data and then conducts phase 2 ensemble from phase 1 ensemble results. Based on the framework, we develop two ensemble approaches: i) data ensemble at phase 1 and algorithm ensemble at phase 2, and ii) algorithm ensemble at phase 1 and data ensemble at phase 2. To achieve scalability, we further parallelize the ensemble approaches via the Spark big data analytics engine. The proposed ensemble approaches are evaluated by synthetic and real-world datasets. Our experiments show that the proposed approaches achieve good accuracy through ensemble and high scalability through data-parallelization in distributed computing environments. (C) 2021 The Author(s). Published by Elsevier Inc.
Year
DOI
Venue
2021
10.1016/j.bdr.2021.100252
BIG DATA RESEARCH
Keywords
DocType
Volume
Causality discovery, Ensemble learning, Data parallelism, Granger causality, Dynamic Bayesian network
Journal
26
ISSN
Citations 
PageRank 
2214-5796
0
0.34
References 
Authors
0
3
Name
Order
Citations
PageRank
Pei Guo101.01
Yiyi Huang200.34
Jianwu Wang321526.72