Title
DPASF: a flink library for streaming data preprocessing
Abstract
Data preprocessing techniques are devoted to correcting or alleviating errors in data. Discretization and feature selection are two of the most extended data preprocessing techniques. Although we can find many proposals for static Big Data preprocessing, there is little research devoted to the continuous Big Data problem. Apache Flink is a recent and novel Big Data framework, following the MapReduce paradigm, focused on distributed stream and batch data processing. In this paper, we propose a data stream library for Big Data preprocessing, named DPASF, under Apache Flink. The library is composed of six of the most popular and widely used data preprocessing algorithms. It contains three algorithms for discretization, and three algorithms for performing feature selection. The algorithms have been tested using two Big Data datasets. Experimental results show that preprocessing can not only reduce the size of the data, but also maintain or even improve the original accuracy in a short period of time. DPASF contains algorithms that are useful when dealing with Big Data data streams. The preprocessing algorithms included in the library are able to tackle Big Datasets efficiently and to correct imperfections in the data.
Year
DOI
Venue
2018
10.1186/s41044-019-0041-8
Big Data Analytics
Keywords
Field
DocType
Flink, Big data, Machine learning, Data preprocessing
Data mining,Discretization,Data stream mining,Data processing,Feature selection,Computer science,Data stream,Data pre-processing,Preprocessor,Big data
Journal
Volume
Issue
ISSN
4
1
2058-6345
Citations 
PageRank 
References 
0
0.34
0
Authors
4
Name
Order
Citations
PageRank
Alejandro Alcalde-Barros100.34
Diego García-Gil2192.69
Salvador García34151118.45
Francisco Herrera4273911168.49