Title
Evaluating different distributed-cyber-infrastructure for data and compute intensive scientific application
Abstract
Scientists are increasingly using the current state of the art big data analytic software (e.g., Hadoop, Giraph, etc.) for their data-intensive applications over HPC environment. However, understanding and designing the hardware environment that these data- and compute-intensive applications require for good performance is challenging. With this motivation, we evaluated the performance of big data software over three different distributed-cyber-infrastructures, including a traditional HPC-cluster called SuperMikeII, a regular datacenter called SwatIII, and a novel MicroBrick-based hyperscale system called CeresII, using our own benchmark Parallel Genome Assembler (PGA). PGA is developed atop Hadoop and Giraph and serves as a good real-world example of a data- as well as compute-intensive workload. To evaluate the impact of both individual hardware components as well as overall organization, we changed the configuration of SwatIII in different ways. Comparing the individual impact of different hardware components (e.g., network, storage and memory) over different clusters, we observed 70% improvement in the Hadoop-workload and almost 35% improvement in the Giraph-workload in SwatIII over SuperMikeII by using SSD (thus, increasing the disk I/O rate) and scaling it up in terms of memory (which increases the caching). Then, we provide significant insight on efficient and cost-effective organization of these hardware components. Here, The MicroBrick-based CeresII prototype shows similar performance as SuperMikeII while giving more than 2-times improvement in performance/$ in the entire benchmark test.
Year
DOI
Venue
2015
10.1109/BigData.2015.7363750
Big Data
Keywords
Field
DocType
giraph-workload,parallel genome assembler,CeresII,MicroBrick-based hyperscale system,SwatIII,SuperMikeII,HPC-cluster,data-intensive application,big data analytic software,distributed-cyber-infrastructure
Data mining,Computer science,Cyber infrastructure,Workload,Software,Hyperscale,Distributed database,Big data,Benchmark (computing)
Conference
Citations 
PageRank 
References 
5
0.46
17
Authors
4
Name
Order
Citations
PageRank
Arghya Kusum Das162.18
Seung-Jong Park231931.12
Jae-Ki Hong391.53
Wooseok Chang471.85