Title
A Large-scale Data Set and an Empirical Study of Docker Images Hosted on Docker Hub
Abstract
Docker is currently one of the most popular containerization solutions. Previous work investigated various characteristics of the Docker ecosystem, but has mainly focused on Dockerfiles from GitHub, limiting the type of questions that can be asked, and did not investigate evolution aspects. In this paper, we create a recent and more comprehensive data set by collecting data from Docker Hub, GitHub, and Bitbucket. Our data set contains information about 3,364,529 Docker images and 378,615 git repositories behind them. Using this data set, we conduct a large-scale empirical study with four research questions where we reproduce previously explored characteristics (e.g., popular languages and base images), investigate new characteristics such as image tagging practices, and study evolution trends. Our results demonstrate the maturity of the Docker ecosystem: we find more reliance on ready-to-use language and application base images as opposed to yet-to-be-configured OS images, a downward trend of Docker image sizes demonstrating the adoption of best practices of keeping images small, and a declining trend in the number of smells in Dockerfiles suggesting a general improvement in quality. On the downside, we find an upward trend in using obsolete OS base images, posing security risks, and find problematic usages of the latest tag, including version lagging. Overall, our results bring good news such as more developers following best practices, but they also indicate the need to build tools and infrastructure embracing new trends and addressing potential issues.
Year
DOI
Venue
2020
10.1109/ICSME46990.2020.00043
2020 IEEE International Conference on Software Maintenance and Evolution (ICSME)
Keywords
DocType
ISSN
Docker images,Docker Hub,popular containerization solutions,Docker ecosystem,GitHub,git repositories,large-scale empirical study,explored characteristics,popular languages,image tagging practices,evolution trends,application base images,obsolete OS base images,large-scale data set
Conference
1063-6773
ISBN
Citations 
PageRank 
978-1-7281-5620-0
0
0.34
References 
Authors
0
3
Name
Order
Citations
PageRank
Changyuan Lin100.34
Sarah Nadi237524.37
Hamzeh Khazaei322317.82