Title
Large-scale seismic waveform quality metric calculation using Hadoop.
Abstract
In this work we investigated the suitability of Hadoop MapReduce and Apache Spark for large-scale computation of seismic waveform quality metrics by comparing their performance with that of a traditional distributed implementation. The Incorporated Research Institutions for Seismology (IRIS) Data Management Center (DMC) provided 43 terabytes of broadband waveform data of which 5.1TB of data were processed with the traditional architecture, and the full 43TB were processed using MapReduce and Spark. Maximum performance of ~0.56 terabytes per hour was achieved using all 5 nodes of the traditional implementation. We noted that I/O dominated processing, and that I/O performance was deteriorating with the addition of the 5th node. Data collected from this experiment provided the baseline against which the Hadoop results were compared. Next, we processed the full 43TB dataset using both MapReduce and Apache Spark on our 18-node Hadoop cluster. These experiments were conducted multiple times with various subsets of the data so that we could build models to predict performance as a function of dataset size. We found that both MapReduce and Spark significantly outperformed the traditional reference implementation. At a dataset size of 5.1 terabytes, both Spark and MapReduce were about 15 times faster than the reference implementation. Furthermore, our performance models predict that for a dataset of 350 terabytes, Spark running on a 100-node cluster would be about 265 times faster than the reference implementation. We do not expect that the reference implementation deployed on a 100-node cluster would perform significantly better than on the 5-node cluster because the I/O performance cannot be made to scale. Finally, we note that although Big Data technologies clearly provide a way to process seismic waveform datasets in a high-performance and scalable manner, the technology is still rapidly changing, requires a high degree of investment in personnel, and will likely require significant changes in other parts of our infrastructure. Nevertheless, we anticipate that as the technology matures and third-party tool vendors make it easier to manage and operate clusters, Hadoop (or a successor) will play a large role in our seismic data processing. MapReduce and Spark are evaluated for calculating signal metrics at scale using a 43-terabyte dataset from the IRIS archive.We implemented a reference architecture with which we could compare the MapReduce and Spark solutions.Spark and MapReduce were 15 times faster than our reference architecture on a 5.1-terabyte dataset.Our model forecasts Apache Spark to be 256 times faster than our reference implementation on a 100-node cluster.Big Data technologies can be leveraged for I/O-intensive workloads like the generation of signal metrics.
Year
DOI
Venue
2016
10.1016/j.cageo.2016.05.012
Computers & Geosciences
Field
DocType
Volume
Data mining,Data processing,Spark (mathematics),Terabyte,Computer science,Real-time computing,Reference implementation,Reference architecture,Data management,Big data,Scalability
Journal
94
Issue
ISSN
Citations 
C
0098-3004
2
PageRank 
References 
Authors
0.38
3
5
Name
Order
Citations
PageRank
Steven Magaña-Zook120.38
Jessie M. Gaylord220.38
Douglas R. Knapp320.38
Douglas A. Dodge420.38
Stan D. Ruppert520.38