Title
Space savings and design considerations in variable length deduplication
Abstract
Explosion of data growth and duplication of data in enterprises has led to the deployment of a variety of deduplication technologies. However not all deduplication technologies serve the needs of every workload. Most prior research in deduplication concentrates on fixed block size (or variable block size at a fixed block boundary) deduplication which provides sub-optimal space efficiency in workloads where the duplicate data is not block aligned. Workloads also differ in the nature of operations and their priorities thereby affecting the choice of the right flavor of deduplication. Object workloads for instance, hold multiple versions of archived documents that have a high degree of duplicate data. They are also write-once read-many in nature and follow a whole object GET, PUT and DELETE model and would be better served by a deduplication strategy that takes care of nonblock aligned changes to data. In this paper, we describe and evaluate a hybrid of a variable length and block based deduplication that is hierarchical in nature. We are motivated by the following insights from real world data: (a) object workload applications do not do in-place modification of data and hence new versions of objects are written again as a whole (b) significant amount of data among different versions of the same object is shareable but the changes are usually not block aligned. While the second point is the basis for variable length technique, both the above insights motivate our hierarchical deduplication strategy. We show through experiments with production data-sets from enterprise environments that this provides up to twice the space savings compared to a fixed block deduplication.
Year
DOI
Venue
2012
10.1145/2421648.2421657
Operating Systems Review
Keywords
Field
DocType
design consideration,fixed block boundary,deduplication technology,hierarchical deduplication strategy,variable block size,deduplication strategy,data growth,real world data,fixed block deduplication,duplicate data,variable length deduplication,space saving,fixed block size,deduplication
Block size,Data deduplication,Software deployment,Workload,Computer science,Real-time computing,Fixed Block,Distributed computing
Journal
Volume
Issue
Citations 
46
3
3
PageRank 
References 
Authors
0.42
14
2
Name
Order
Citations
PageRank
Giridhar Appaji Nag Yasa161.16
P. C. Nagesh2494.38