Title
Generalized Deduplication: Bounds, Convergence, and Asymptotic Properties.
Abstract
We study a generalization of deduplication, which enables lossless deduplication of highly similar data and show that classic deduplication with fixed chunk length is a special case. We provide bounds on the expected length of coded sequences for generalized deduplication and show that the coding has asymptotic near-entropy cost under the proposed source model. More importantly, we show that generalized deduplication allows for multiple orders of magnitude faster convergence than classic deduplication. This means that generalized deduplication can provide compression benefits much earlier than classic deduplication, which is key in practical systems. Numerical examples demonstrate our results, showing that our lower bounds are achievable, and illustrating the potential gain of using the generalization over classic deduplication. In fact, we show that even for a simple case of generalized deduplication, the gain in convergence speed is linear with the size of the data chunks.
Year
DOI
Venue
2019
10.1109/globecom38437.2019.9014012
IEEE Global Communications Conference
Field
DocType
Volume
Convergence (routing),Data deduplication,Mathematical optimization,Algorithm,Coding (social sciences),Source model,Mathematics,Special case,Lossless compression
Journal
abs/1901.02720
ISSN
Citations 
PageRank 
2334-0983
0
0.34
References 
Authors
6
3
Name
Order
Citations
PageRank
Rasmus Vestergaard113.40
Qi Zhang2137.05
Daniel E. Lucani323642.29