Title
Subspace selection in high-dimensional big data using genetic algorithm in apache spark.
Abstract
In high-dimensional space with large amounts of data, distances between data points tend to become relatively uniform. The notion of the nearest neighbours of a data point thus becomes meaningless, a phenomenon known as curse of dimensionality. Identifying outliers (data points with statistical characteristics significantly different than the majority of the data) in such a high-dimensional space can be a significant challenge. Mining for outliers in subspaces with relevant attributes is one of approaches for this problem, and identifying these attributes is the main objective of this work. In this paper, we scale a grid-based solution to search for subspaces that are candidates for outlier detection with regard to the subset of features in the subspace. We specify a population and a fitness function for a distributed genetic algorithm to heuristically search the subspaces within the high dimensional data, and find the subspace with maximal sparsity. We designed and implemented our proposed subspace selection algorithm in Apache Spark, a fast in-memory engine for large-scale data processing. The initial experimental results on a large dataset (77,000 records and 1,379 attributes) confirm that our proposed method can identify the most relevant subspaces for outlier detection.
Year
Venue
Field
2017
ICC
Data point,Anomaly detection,Data mining,Clustering high-dimensional data,Subspace topology,Computer science,Outlier,Computer network,Curse of dimensionality,Linear subspace,Fitness function
DocType
Citations 
PageRank 
Conference
0
0.34
References 
Authors
13
3
Name
Order
Citations
PageRank
Fatemeh Cheraghchi152.44
Arash Iranzad200.34
Bijan Raahemi315522.29