Title
Large-Scale Analysis of the Docker Hub Dataset
Abstract
Docker containers have become a prominent solution for supporting modern enterprise applications due to the highly desirable features of isolation, low overhead, and efficient packaging of the execution environment. Containers are created from images which are shared between users via a Docker registry. The amount of data Docker registries store is massive; for example, Docker Hub, a popular public registry, stores at least half a million public images. In this paper, we analyze over 167 TB of uncompressed Docker Hub images, characterize them using multiple metrics and evaluate the potential of file-level deduplication in Docker Hub. Our analysis helps to make conscious decisions when designing storage for containers in general and Docker registries in particular. For example, only 3% of the files in images are unique, which means file-level deduplication has a great potential to save storage space for the registry. Our findings can motivate and help improve the design of data reduction, caching, and pulling optimizations for registries.
Year
DOI
Venue
2019
10.1109/CLUSTER.2019.8891000
2019 IEEE International Conference on Cluster Computing (CLUSTER)
Keywords
Field
DocType
Docker containers,Docker registry,uncompressed Docker Hub images,file-level deduplication,Docker Hub dataset analysis,data reduction,data caching,container-based virtualization,container image storage,container image sharing,image analysis,image representation
Virtualization,Data deduplication,Metadata,Computer science,Image coding,Database,Distributed computing,Uncompressed video,Virtual machining
Conference
ISSN
ISBN
Citations 
1552-5244
978-1-7281-4735-2
0
PageRank 
References 
Authors
0.34
8
9
Name
Order
Citations
PageRank
Nannan Zhao152.22
Vasily Tarasov219918.98
Hadeel Albahar301.35
Ali Anwar411314.83
Lukas Rupprecht56010.88
Dimitrios Skourtis672.30
Amit S. Warke700.34
Mohamed Mohamed852.25
Ali R. Butt965147.51