Title
Towards automated prediction of relationships among scientific datasets
Abstract
Before scientists can analyze, publish, or share their data, they often need to determine how their datasets are related. Determining relationships helps scientists identify the most complete version of a dataset, detect versions of datasets that complement each other, and determine multiple datasets that overlap. In previous work, we showed how observable relationships between two datasets help scientists recall their original derivation connection. While that work helped with identifying relationships between two datasets, it is infeasible for scientists to use it for finding relationships between all possible pairs in a large collection of datasets. In order to deal with larger numbers of datasets, we are extending our methodology with a relationship-prediction system, ReDiscover, a tool to identify pairs from a collection of datasets that are most likely related and the relationship between them. We report on the initial design of ReDiscover, which uses machine-learning methods such as Conditional Random Fields and Support Vector Machines to the relationship-discovery problem. Our preliminarily evaluation shows that ReDiscover predicted relationships with an average accuracy of 87%.
Year
DOI
Venue
2015
10.1145/2791347.2791358
International Conference on Scientific and Statistical DB Management
Keywords
Field
DocType
Data Extraction, Data Profiling, Schema Matching, Conditional Random Fields (CRFs), Support Vector Machines (SVMs)
Conditional random field,Publication,Data mining,Computer science,Support vector machine,Data profiling,Data extraction,Schema matching,Database,Support vector machines svms
Conference
Citations 
PageRank 
References 
0
0.34
9
Authors
5
Name
Order
Citations
PageRank
Abdussalam Alawini1224.45
David Maier256391666.90
Kristin Tufte31241146.09
Bill Howe4152094.44
Rashmi Nandikur500.34