Title
Towards Cluster-wide Deduplication Based on Ceph
Abstract
In this paper, we design an efficient deduplication algorithm based on the distributed storage architecture of Ceph. The algorithm uses on-line block-level data deduplication technology to complete data slicing, which neither affects the data storage process in Ceph nor alter other interfaces and functions in Ceph. Without relying on any central node, the algorithm maintains the characteristics of Ceph by designing a special hash object to store the data fingerprint, and uses the CRUSH algorithm to judge the data duplication based on calculation, instead of global search. The algorithm replaces the duplicate data with the deduplicated objects, which storage their fingerprints with less storage space. We compare the effects of different block sizes with respect to the performance and deduplication rates through experimental studies, and select the most appropriate block size in our prototype implementation. The experimental results show that the algorithm can not only effectively save the storage space but also improve the bandwidth utilization when reading and writing the duplicate data.
Year
DOI
Venue
2019
10.1109/NAS.2019.8834729
2019 IEEE International Conference on Networking, Architecture and Storage (NAS)
Keywords
Field
DocType
deduplication,distributed storage system,Ceph
Data deduplication,Block size,Computer data storage,Computer science,Slicing,Parallel computing,Distributed data store,Computer network,Fingerprint,Hash function,Central node
Conference
ISBN
Citations 
PageRank 
978-1-7281-4410-8
1
0.35
References 
Authors
8
7
Name
Order
Citations
PageRank
Jinpeng Wang110.35
Yang Wang218845.73
Hekang Wang310.35
Kejiang Ye428526.07
Z. Chen53443271.62
Shuibing He610920.45
Lingfang Zeng736533.99