Helping scientists reconnect their datasets - Citegraph

Paper Info

Title
Helping scientists reconnect their datasets

Abstract
It seems inevitable that the datasets associated with a research project proliferate over time: collaborators may extend datasets with new measurements and new attributes, new experimental runs result in new files with similar structures, and subsets of data are extracted for independent analysis. As these \"residual\" datasets begin to accrete over time, scientists can lose track of the derivation history that connects them, complicating data sharing, provenance tracking, and scientific reproducibility. In this paper, focusing on data in spreadsheets, we consider how observable relationships between two datasets can help scientists recall their original derivation connection. For instance, if dataset A is wholly contained in dataset B, B may be a more recent version of A and should be preferred when archiving or publishing. We articulate a space of relevant relationships, develop a set of algorithms for efficient discovery of these relationships, and organize these algorithms into a new system called ReConnect to assist scientists in relationship discovery. Our evaluation shows that existing approaches that rely on flagging differences between two spreadsheets are impractical for many relationship-discovery tasks, and a user study shows that ReConnect can improve scientists' ability to detect useful relationships and subsequently identify the best dataset for a given task.

Year	DOI	Venue
2014	10.1145/2618243.2618263	SSDBM
Keywords	Field	DocType
general,relationship identification,scientific data management,spreadsheets	Data science,Data mining,Computer science,Flagging,Data sharing,Database	Conference
Citations	PageRank	References
4	0.45	12
Authors
4

Authors (4 rows)

Cited by (4 rows)

References (12 rows)

Name	Order	Citations	PageRank
Abdussalam Alawini	1	22	4.45
David Maier	2	5639	1666.90
Kristin Tufte	3	1241	146.09
Bill Howe	4	1520	94.44

1