Title
Tutorial on Practical Tips of the Most Influential Data Preprocessing Algorithms in Data Mining
Abstract
Data preprocessing is a major and essential stage whose main goal is to obtain final data sets that can be considered correct and useful for further data mining algorithms. This paper summarizes the most influential data preprocessing algorithms according to their usage, popularity and extensions proposed in the specialized literature. For each algorithm, we provide a description, a discussion on its impact, and a review of current and further research on it. These most influential algorithms cover missing values imputation, noise filtering, dimensionality reduction (including feature selection and space transformations), instance reduction (including selection and generation), discretization and treatment of data for imbalanced preprocessing. They constitute all among the most important topics in data preprocessing research and development. This paper emphasizes on the most well-known preprocessing methods and their practical study, selected after a recent, generic book on data preprocessing that does not deepen on them. This manuscript also presents an illustrative study in two sections with different data sets that provide useful tips for the use of preprocessing algorithms. In the first place, we graphically present the effects on two benchmark data sets for the preprocessing methods. The reader may find useful insights on the different characteristics and outcomes generated by them. Secondly, we use a real world problem presented in the ECDBL'2014 Big Data competition to provide a thorough analysis on the application of some preprocessing techniques, their combination and their performance. As a result, five different cases are analyzed, providing tips that may be useful for readers.
Year
DOI
Venue
2016
10.1016/j.knosys.2015.12.006
Knowledge-Based Systems
Keywords
Field
DocType
Data preprocessing,Data reduction,Missing values imputation,Noise filtering,Dimensionality reduction,Instance reduction,Discretization,Data mining
Data mining,Data set,Dimensionality reduction,Feature selection,Computer science,Data pre-processing,Artificial intelligence,Missing data,Algorithm,Preprocessor,Imputation (statistics),Big data,Machine learning
Journal
Volume
Issue
ISSN
98
C
0950-7051
Citations 
PageRank 
References 
31
0.81
125
Authors
3
Search Limit
100125
Name
Order
Citations
PageRank
Salvador García1121934.57
Julian Luengo2241877.15
Francisco Herrera3273911168.49