Title
Analysis Of Data Persistence In Collaborative Content Creation Systems: The Wikipedia Case
Abstract
A very common problem in designing caching/prefetching systems, distribution networks, search engines, and web-crawlers is determining how long a given content lasts before being updated, i.e., its update frequency. Indeed, while some content is not frequently updated (e.g., videos), in other cases revisions periodically invalidate contents. In this work, we present an analysis of Wikipedia, currently the 5th most visited website in the world, evaluating the statistics of updates of its pages and their relationship with page view statistics. We discovered that the number of updates of a page follows a lognormal distribution. We provide fitting parameters as well as a goodness of fit analysis, showing the statistical significance of the model to describe the empirical data. We perform an analysis of the views-updates relationship, showing that in a time period of a month, there is a lack of evident correlation between the most updated pages and the most viewed pages. However, observing specific pages, we show that there is a strong correlation between the peaks of views and updates, and we find that in more than 50% of cases, the time difference between the two peaks is less than a week. This reflects the underlying process whereby an event causes both an update and a visit peak that occurs with different time delays. This behavior can pave the way for predictive traffic analysis applications based on content update statistics. Finally, we show how the model can be used to evaluate the performance of an in-network caching scenario.
Year
DOI
Venue
2019
10.3390/info10110330
INFORMATION
Keywords
Field
DocType
Wikipedia, real-data statistics, update statistics, popularity, caching, content revisions
Data mining,Traffic analysis,Data analysis,Computer science,Popularity,Correlation,Content creation,Log-normal distribution,Page view,Goodness of fit
Journal
Volume
Issue
Citations 
10
11
0
PageRank 
References 
Authors
0.34
0
4
Name
Order
Citations
PageRank
Lorenzo Bracciale16811.88
Pierpaolo Loreti29318.75
A. Detti354747.83
Nicola Blefari-melazzi425138.89