Title
Towards a Better Replica Management for Hadoop Distributed File System
Abstract
The Hadoop Distributed File System (HDFS) is the storage of choice when it comes to large-scale distributed systems. In addition to being efficient and scalable, HDFS provides high throughput and reliability through the replication of data. Recent work exploits this replication feature by dynamically varying the replication factor of in-demand data as a means of increasing data locality and achieving a performance improvement. However, to the best of our knowledge, no study has been performed on the consequences of varying the replication factor. In particular, our work is the first to show that although HDFS deals well with increasing the replication factor, it experiences problems with decreasing it. This leads to unbalanced data, hot spots, and performance degradation. In order to address this problem, we propose a new workload-aware balanced replica deletion algorithm. We also show that our algorithm successfully maintains the data balance and achieves up to 48% improvement in execution time when compared to HDFS, while only creating an overhead of 1.69% on average.
Year
DOI
Venue
2018
10.1109/BigDataCongress.2018.00021
2018 IEEE International Congress on Big Data (BigData Congress)
Keywords
Field
DocType
Hadoop Distributed File System,Replication Factor,Software Performance
Distributed File System,Data mining,Replica,Locality,Computer science,Throughput,Cluster analysis,Big data,Performance improvement,Scalability,Distributed computing
Conference
ISSN
ISBN
Citations 
2379-7703
978-1-5386-7233-4
3
PageRank 
References 
Authors
0.47
0
5
Name
Order
Citations
PageRank
Hilmi Egemen Ciritoglu131.14
Takfarinas Saber2254.90
Teodora Sandra Buda3267.50
John Murphy459752.43
Christina Thorpe5539.00