Title
IC-Data: Improving Compressed Data Processing in Hadoop.
Abstract
As dataset sizes for data analytic applications and scientific applications running on Hadoop increases, data compression has become essential to store this data within a reasonable storage cost. Although data is often stored compressed, currently Hadoop takes 49% longer to process compressed data compared to uncompressed data. Processing compressed data reduces the amount of task parallelism and creates uneven workload distribution both of which are fundamental issues the MapReduce parallel programming paradigm should alleviate. In this paper, we propose the design and implementation of a Network Overlapped Compression scheme, NOC, and Compression Aware Storage scheme, CAS. NOC reduces data load time and hides compression overhead by interleaving network I/O with compression. CAS increases parallelism by dynamically changing a file's block size based on compression ratio. Additionally, we develop a MapReduce Module which recognizes the characteristics of compressed data to improve resource allocation and load balance. Collectively, NOC, CAS, and the MapReduce Module decrease job execution time on average by 66% and data load time by 31%.
Year
DOI
Venue
2015
10.1109/HiPC.2015.28
HiPC
Keywords
Field
DocType
Big Data Processing, Data Compression, Hadoop
Computer science,Task parallelism,Load balancing (computing),Parallel computing,Compression ratio,Resource allocation,Distributed database,Data compression,Interleaving,Uncompressed video
Conference
Citations 
PageRank 
References 
0
0.34
13
Authors
5
Name
Order
Citations
PageRank
Adnan Haider1192.39
Xi Yang2755.61
Ning Liu3240.98
Xian-he Sun41987182.64
Shuibing He510920.45