Feature Ranking Based on Information Gain for Large Classification Problems with MapReduce - Citegraph

Paper Info

Title
Feature Ranking Based on Information Gain for Large Classification Problems with MapReduce

Abstract
In classification problems the large number of features can pose a significant challenge from many aspects. This is particularly the case in the context of Big Data. In order to address this issue we propose a distributed and parallel computation of information gain based on MapReduce. The proposed implementation on Hadoop can be used for ranking features of large datasets and furthermore for feature selection. The data-parallelism is achieved by uniformly distributing it using HBase tables with proper row keys. Performance evaluations are made by estimation of the speed-up of multi-node clusters against a one-node cluster. The framework was deployed on a on-premises Hadoop cluster. The results show that by parallelization and distribution of the computations on a cluster significant speedup can be achieved. The main contribution of this paper is that we have demonstrated how the higher level scripting language Pig Latin can be used for writing MapReduce jobs instead of directly writing a separate map and reduce function. Additionally, we have proposed the use of manually pre-splitted HBase tables instead of HDFS files for data fragmentation in order to set the degree of parallelism on a higher level.

Year	DOI	Venue
2015	10.1109/Trustcom-BigDataSe-ISPA.2015.580	TrustCom/BigDataSE/ISPA
Keywords	Field	DocType
Hadoop, HBase, MapReduce, information gain, parallelization, feature ranking	Data mining,Ranking,Feature selection,Data-intensive computing,Degree of parallelism,Computer science,Theoretical computer science,Big data,Scripting language,Computation,Speedup	Conference
Volume	ISSN	Citations
2	2324-9013	2
PageRank	References	Authors
0.38	16	6

Authors (6 rows)

Cited by (2 rows)

References (16 rows)

Name	Order	Citations	PageRank
Eftim Zdravevski	1	57	16.51
Petre Lameski	2	61	13.84
Andrea Kulakov	3	98	14.79
Boro Jakimovski	4	75	10.05
Sonja Filiposka	5	59	13.13
Dimitar Trajanov	6	51	15.57

1