Title
Efficient Metadata Generation to Enable Interactive Data Discovery over Large-Scale Scientific Data Collections
Abstract
Discovering the correct dataset efficiently is critical for computations and effective simulations in scientific experiments. In contrast to searching web documents over the Internet, massive binary datasets are difficult to browse or search. Users must select a reliable data publisher from the large collection of data services available over the Internet. Once a publisher is selected, the user must then discover the dataset that matches the computation芒€™s needs, among tens of thousands of large data packages that are available. Some of the data hosting services provide advanced data search interfaces but their search scope is often limited to local datasets. Because scientific datasets are often encoded as binary data formats, querying or validating missing data over hundreds of Megabytes of a binary file involves a compute intensive decoding process. We have developed a system, GLEAN, that provides an efficient data discovery environment for users in scientific computing. Fine-grained metadata is automatically extracted to provide a micro view and profile of the large dataset to the users. We have used the Granules cloud runtime to orchestrate the MapReduce computations that extract metadata from the datasets. Here we focus on the overall architecture of the system and how it enables efficient data discovery. We applied our framework to a data discovery application in the atmospheric science domain. This paper includes a performance evaluation with observational datasets.
Year
DOI
Venue
2010
10.1109/CloudCom.2010.99
CloudCom
Keywords
Field
DocType
efficient data discovery environment,data service,large data package,large-scale scientific data collections,local datasets,data discovery application,enable interactive data discovery,reliable data publisher,efficient metadata generation,missing data,binary data format,efficient data discovery,advanced data search interface,metadata,computational modeling,scientific data,web services,service provider,data mining,atmospheric science,atmospheric sciences,internet,data analysis,decoding,data services,atmospheric modeling,scientific computing,feature extraction,cloud computing,data visualisation,databases,meta data
Metadata,Data mining,Data discovery,Metadata repository,Data visualization,Information retrieval,Computer science,Data element,Missing data,Cloud computing,The Internet
Conference
Citations 
PageRank 
References 
4
0.47
9
Authors
4
Name
Order
Citations
PageRank
Sangmi Lee Pallickara117024.46
Shrideep Pallickara283792.72
Milija Zupanski3101.36
Stephen Sullivan440.47