Title
On Efficiently Capturing Scientific Properties in Distributed Big Data without Moving the Data: A Case Study in Distributed Structural Biology Using MapReduce
Abstract
In this paper, we present two variations of a general analysis algorithm for large datasets residing in distributed memory systems. Both variations avoid the need to move data among nodes because they extract relevant data properties locally and concurrently and transform the analysis problem (e.g., clustering or classification) into a search for property aggregates. We test the two variations using the SDSC's supercomputer Gordon, the MapReduce-MPI library, and a structural biology dataset of 100 million protein-ligand records. We evaluate both variations for their sensitivity to data distribution and load imbalance. Our observations indicate that the first variation is sensitive to data content and distribution while the second variation is not. Moreover, the second variation can self-heal load imbalance and it outperforms the first in all the fifteen cases considered.
Year
DOI
Venue
2013
10.1109/CSE.2013.28
C3S2E
Keywords
Field
DocType
memory system,analysis problem,data content,big data,large datasets,case study,relevant data property,data distribution,mapreduce-mpi library,general analysis algorithm,fifteen case,load imbalance,structural biology,efficiently capturing scientific properties,data analysis,distributed databases
Data mining,First variation,Supercomputer,Computer science,Structural biology,Distributed database,Cluster analysis,Big data,Distributed memory systems,Data content
Conference
ISSN
Citations 
PageRank 
1949-0828
5
0.49
References 
Authors
8
4
Name
Order
Citations
PageRank
boyu zhang17117.54
Trilce Estrada212018.27
Pietro Cicotti310114.52
michela taufer435253.04