SpreadCluster: Recovering Versioned Spreadsheets through Similarity-Based Clustering. - Citegraph

Paper Info

Title
SpreadCluster: Recovering Versioned Spreadsheets through Similarity-Based Clustering.

Abstract
Version information plays an important role in spreadsheet understanding, maintaining and quality improving. However, end users rarely use version control tools to document spreadsheets' version information. Thus, the spreadsheets' version information is missing, and different versions of a spreadsheet coexist as individual and similar spreadsheets. Existing approaches try to recover spreadsheet version information through clustering these similar spreadsheets based on spreadsheet filenames or related email conversation. However, the applicability and accuracy of existing clustering approaches are limited due to the necessary information (e.g., filenames and email conversation) is usually missing. We inspected the versioned spreadsheets in VEnron, which is extracted from the Enron Corporation. In VEnron, the different versions of a spreadsheet are clustered into an evolution group. We observed that the versioned spreadsheets in each evolution group exhibit certain common features (e.g., similar table headers and worksheet names). Based on this observation, we proposed an automatic clustering algorithm, SpreadCluster. SpreadCluster learns the criteria of features from the versioned spreadsheets in VEnron, and then automatically clusters spreadsheets with the similar features into the same evolution group. We applied SpreadCluster on all spreadsheets in the Enron corpus. The evaluation result shows that SpreadCluster could cluster spreadsheets with higher precision (78.5% vs. 59.8%) and recall rate (70.7% vs. 48.7%) than the filename-based approach used by VEnron. Based on the clustering result by SpreadCluster, we further created a new versioned spreadsheet corpus VEnron2, which is much bigger than VEnron (12,254 vs. 7,294 spreadsheets). We also applied SpreadCluster on the other two spreadsheet corpora FUSE and EUSES. The results show that SpreadCluster can cluster the versioned spreadsheets in these two corpora with high precision (91.0% and 79.8%).

Year	DOI	Venue
2017	10.1109/MSR.2017.28	MSR
Keywords	DocType	Volume
spreadsheet, evolution, clustering, version	Conference	abs/1704.08476
ISSN	ISBN	Citations
2160-1852	978-1-5386-1545-4	3
PageRank	References	Authors
0.37	29	7

Authors (7 rows)

Cited by (3 rows)

References (29 rows)

Name	Order	Citations	PageRank
Liang Xu	1	57	14.47
Wenshen Dou	2	100	15.17
Chushu Gao	3	72	9.84
Jie Wang	4	21	3.04
Jun Wei	5	582	88.35
Hua Zhong	6	81	14.80
Tao Huang	7	154	18.57

1