Tutorial on Practical Tips of the Most Influential Data Preprocessing Algorithms in Data Mining - Citegraph

Paper Info

Title
Tutorial on Practical Tips of the Most Influential Data Preprocessing Algorithms in Data Mining

Abstract
Data preprocessing is a major and essential stage whose main goal is to obtain final data sets that can be considered correct and useful for further data mining algorithms. This paper summarizes the most influential data preprocessing algorithms according to their usage, popularity and extensions proposed in the specialized literature. For each algorithm, we provide a description, a discussion on its impact, and a review of current and further research on it. These most influential algorithms cover missing values imputation, noise filtering, dimensionality reduction (including feature selection and space transformations), instance reduction (including selection and generation), discretization and treatment of data for imbalanced preprocessing. They constitute all among the most important topics in data preprocessing research and development. This paper emphasizes on the most well-known preprocessing methods and their practical study, selected after a recent, generic book on data preprocessing that does not deepen on them. This manuscript also presents an illustrative study in two sections with different data sets that provide useful tips for the use of preprocessing algorithms. In the first place, we graphically present the effects on two benchmark data sets for the preprocessing methods. The reader may find useful insights on the different characteristics and outcomes generated by them. Secondly, we use a real world problem presented in the ECDBL'2014 Big Data competition to provide a thorough analysis on the application of some preprocessing techniques, their combination and their performance. As a result, five different cases are analyzed, providing tips that may be useful for readers.

Year	DOI	Venue
2016	10.1016/j.knosys.2015.12.006	Knowledge-Based Systems
Keywords	Field	DocType
Data preprocessing,Data reduction,Missing values imputation,Noise filtering,Dimensionality reduction,Instance reduction,Discretization,Data mining	Data mining,Data set,Dimensionality reduction,Feature selection,Computer science,Data pre-processing,Artificial intelligence,Missing data,Algorithm,Preprocessor,Imputation (statistics),Big data,Machine learning	Journal
Volume	Issue	ISSN
98	C	0950-7051
Citations	PageRank	References
31	0.81	125
Authors
3

Search Limit

100125

Authors (3 rows)

Cited by (31 rows)

References (100 rows)

Name	Order	Citations	PageRank
Salvador García	1	1219	34.57
Julian Luengo	2	2418	77.15
Francisco Herrera	3	27391	1168.49

1