Title
Shock: Active Storage for Multicloud Streaming Data Analysis
Abstract
Access to data plays a major role in designing and performing efficient data computation and analyses in a distributed environment. Existing approaches access data via a variety of methods and offer various benefits and drawbacks based on the use case. Our original use case was the computational analysis of environmental sequence data, or metagenomics. Unlike other workflows that often reduce the dataset size dramatically within the first few processing steps, owing to biologially-motivated data compression. Metagenomic data compresses poorly, and so metagenomic workflows add to the size of the data set along the processing pipeline. Thus, wide-area, high-throughput access to the data is essential. To address this problem, we developed Shock, a data store for files, their associated metadata, and indexes that allow Shock to provide different views into the data. Shock comprises three major components: a web service that provides a RESTful API, backend data storage for files, and storage for object metadata. Shock has proven to be a stable data store for MG-RAST, an application that served over 40,000 users in 2014 on a server that houses more than 3 million data objects. Moreover, Shock provides both subselection and high-performance file transfer capabilities that serve most usages.
Year
DOI
Venue
2015
10.1109/BDC.2015.40
2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC)
Keywords
Field
DocType
bioinformatics,metagenomics,active object store,distributed wide-area computing
Data warehouse,Metadata repository,Metadata,Data element,Computer science,Data mapping,Server,File transfer,Data access,Database
Conference
Citations 
PageRank 
References 
1
0.39
3
Authors
9
Name
Order
Citations
PageRank
Andreas Wilke131423.84
Wolfgang Gerlach2817.03
Travis Harrison3635.58
T Paczian422121.05
Wei Tang510.39
William L. Trimble610.39
Jared Wilkening7483.77
Narayan Desai831929.73
Folker Meyer948451.83