Title
Computing Mutual Information Of Big Categorical Data And Its Application To Feature Grouping
Abstract
This paper develops a parallel computing system - MiCS - for mutual information of big categorical data on the Spark computing platform. The MiCS algorithm is conductive to processing a large amount and strong repeatability of mutual-information calculation among feature pairs by applying a column-wise transformation scheme. And to improve the efficiency of the MiCS and the utilization rate of Spark cluster resources, we adopt a virtual partitioning scheme to achieve balanced load while mitigating the data skewness problem in the Spark Shuffle process.
Year
DOI
Venue
2020
10.1109/ICDE48307.2020.00210
2020 IEEE 36TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2020)
Keywords
DocType
ISSN
Parallel Mutual-information Computation, Feature Grouping, Data Skewness, Big Categorical Data, Spark
Conference
1084-4627
Citations 
PageRank 
References 
0
0.34
0
Authors
4
Name
Order
Citations
PageRank
Junli Li121.70
Chaowei Zhang201.69
Jifu Zhang39519.42
Xiao Qin41836125.69