Title
From Hyper-dimensional Structures to Linear Structures: Maintaining Deduplicated Data’s Locality
Abstract
Data deduplication is widely used to reduce the size of backup workloads, but it has the known disadvantage of causing poor data locality, also referred to as the fragmentation problem. This results from the gap between the hyper-dimensional structure of deduplicated data and the sequential nature of many storage devices, and this leads to poor restore and garbage collection (GC) performance. Current research has considered writing duplicates to maintain locality (e.g., rewriting) or caching data in memory or SSD, but fragmentation continues to lower restore and GC performance.Investigating the locality issue, we design a method to flatten the hyper-dimensional structured deduplicated data to a one-dimensional format, which is based on classification of each chunk’s lifecycle, and this creates our proposed data layout. Furthermore, we present a novel management-friendly deduplication framework, called MFDedup, that applies our data layout and maintains locality as much as possible. Specifically, we use two key techniques in MFDedup: Neighbor-duplicate-focus indexing (NDF) and Across-version-aware Reorganization scheme (AVAR). NDF performs duplicate detection against a previous backup, then AVAR rearranges chunks with an offline and iterative algorithm into a compact, sequential layout, which nearly eliminates random I/O during file restores after deduplication.Evaluation results with five backup datasets demonstrate that, compared with state-of-the-art techniques, MFDedup achieves deduplication ratios that are 1.12× to 2.19× higher and restore throughputs that are 1.92× to 10.02× faster due to the improved data layout. While the rearranging stage introduces overheads, it is more than offset by a nearly-zero overhead GC process. Moreover, the NDF index only requires indices for two backup versions, while the traditional index grows with the number of versions retained.
Year
DOI
Venue
2022
10.1145/3507921
ACM Transactions on Storage
Keywords
DocType
Volume
Fragmentation, restore, garbage collection
Journal
18
Issue
ISSN
Citations 
3
1553-3077
0
PageRank 
References 
Authors
0.34
0
6
Name
Order
Citations
PageRank
Xiangyu Zou145.52
Jingsong Yuan200.34
Philip Shilane300.34
Wen Xia429220.79
Haijun Zhang501.01
Xuan Wang629157.12