Title
Towards efficient data search and subsetting of large-scale atmospheric datasets
Abstract
Discovering the correct dataset in an efficient fashion is critical for effective simulations in the atmospheric sciences. Unlike text-based web documents, many of the large scientific datasets often contain binary encoded data that is hard to discover using popular search engines. In the atmospheric sciences, there has been a significant growth in public data hosting services. However, the ability to index and search has been limited by the metadata provided by the data host. We have developed an infrastructure-Atmospheric Data Discovery System (ADDS)-that provides an efficient data discovery environment for observational datasets in the atmospheric sciences. To support complex querying capabilities, we automatically extract and index fine-grained metadata. Datasets are indexed based on periodic crawling of popular sites and also of files requested by the users. Users are allowed to access subsets of a large dataset through our data customization feature. Our focus is the overall architecture, data subsetting scheme, and a performance evaluation of our system.
Year
DOI
Venue
2012
10.1016/j.future.2011.05.010
Future Generation Comp. Syst.
Keywords
Field
DocType
efficient data discovery environment,large-scale datasets,atmospheric sciences,discovery,towards efficient data search,cloud computing,efficient fashion,correct dataset,data host,large dataset,large-scale atmospheric datasets,binary encoded data,data customization feature,atmospheric science,public data,index fine-grained metadata,metadata
Metadata,Metadata repository,Data mining,Data discovery,Crawling,Search engine,Information retrieval,Computer science,Data element,Personalization,Cloud computing
Journal
Volume
Issue
ISSN
28
1
Future Generation Computer Systems
Citations 
PageRank 
References 
4
0.40
8
Authors
3
Name
Order
Citations
PageRank
Sangmi Lee Pallickara117024.46
Shrideep Pallickara283792.72
Milija Zupanski3101.36