DPASF: a flink library for streaming data preprocessing - Citegraph

Paper Info

Title
DPASF: a flink library for streaming data preprocessing

Abstract
Data preprocessing techniques are devoted to correcting or alleviating errors in data. Discretization and feature selection are two of the most extended data preprocessing techniques. Although we can find many proposals for static Big Data preprocessing, there is little research devoted to the continuous Big Data problem. Apache Flink is a recent and novel Big Data framework, following the MapReduce paradigm, focused on distributed stream and batch data processing. In this paper, we propose a data stream library for Big Data preprocessing, named DPASF, under Apache Flink. The library is composed of six of the most popular and widely used data preprocessing algorithms. It contains three algorithms for discretization, and three algorithms for performing feature selection. The algorithms have been tested using two Big Data datasets. Experimental results show that preprocessing can not only reduce the size of the data, but also maintain or even improve the original accuracy in a short period of time. DPASF contains algorithms that are useful when dealing with Big Data data streams. The preprocessing algorithms included in the library are able to tackle Big Datasets efficiently and to correct imperfections in the data.

Year	DOI	Venue
2018	10.1186/s41044-019-0041-8	Big Data Analytics
Keywords	Field	DocType
Flink, Big data, Machine learning, Data preprocessing	Data mining,Discretization,Data stream mining,Data processing,Feature selection,Computer science,Data stream,Data pre-processing,Preprocessor,Big data	Journal
Volume	Issue	ISSN
4	1	2058-6345
Citations	PageRank	References
0	0.34	0
Authors
4

Authors (4 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Alejandro Alcalde-Barros	1	0	0.34
Diego García-Gil	2	19	2.69
Salvador García	3	4151	118.45
Francisco Herrera	4	27391	1168.49

1