Subspace selection in high-dimensional big data using genetic algorithm in apache spark. - Citegraph

Paper Info

Title
Subspace selection in high-dimensional big data using genetic algorithm in apache spark.

Abstract
In high-dimensional space with large amounts of data, distances between data points tend to become relatively uniform. The notion of the nearest neighbours of a data point thus becomes meaningless, a phenomenon known as curse of dimensionality. Identifying outliers (data points with statistical characteristics significantly different than the majority of the data) in such a high-dimensional space can be a significant challenge. Mining for outliers in subspaces with relevant attributes is one of approaches for this problem, and identifying these attributes is the main objective of this work. In this paper, we scale a grid-based solution to search for subspaces that are candidates for outlier detection with regard to the subset of features in the subspace. We specify a population and a fitness function for a distributed genetic algorithm to heuristically search the subspaces within the high dimensional data, and find the subspace with maximal sparsity. We designed and implemented our proposed subspace selection algorithm in Apache Spark, a fast in-memory engine for large-scale data processing. The initial experimental results on a large dataset (77,000 records and 1,379 attributes) confirm that our proposed method can identify the most relevant subspaces for outlier detection.

Year	Venue	Field
2017	ICC	Data point,Anomaly detection,Data mining,Clustering high-dimensional data,Subspace topology,Computer science,Outlier,Computer network,Curse of dimensionality,Linear subspace,Fitness function
DocType	Citations	PageRank
Conference	0	0.34
References	Authors
13	3

Authors (3 rows)

Cited by (0 rows)

References (13 rows)

Name	Order	Citations	PageRank
Fatemeh Cheraghchi	1	5	2.44
Arash Iranzad	2	0	0.34
Bijan Raahemi	3	155	22.29

1