On Efficiently Capturing Scientific Properties in Distributed Big Data without Moving the Data: A Case Study in Distributed Structural Biology Using MapReduce - Citegraph

Paper Info

Title
On Efficiently Capturing Scientific Properties in Distributed Big Data without Moving the Data: A Case Study in Distributed Structural Biology Using MapReduce

Abstract
In this paper, we present two variations of a general analysis algorithm for large datasets residing in distributed memory systems. Both variations avoid the need to move data among nodes because they extract relevant data properties locally and concurrently and transform the analysis problem (e.g., clustering or classification) into a search for property aggregates. We test the two variations using the SDSC's supercomputer Gordon, the MapReduce-MPI library, and a structural biology dataset of 100 million protein-ligand records. We evaluate both variations for their sensitivity to data distribution and load imbalance. Our observations indicate that the first variation is sensitive to data content and distribution while the second variation is not. Moreover, the second variation can self-heal load imbalance and it outperforms the first in all the fifteen cases considered.

Year	DOI	Venue
2013	10.1109/CSE.2013.28	C3S2E
Keywords	Field	DocType
memory system,analysis problem,data content,big data,large datasets,case study,relevant data property,data distribution,mapreduce-mpi library,general analysis algorithm,fifteen case,load imbalance,structural biology,efficiently capturing scientific properties,data analysis,distributed databases	Data mining,First variation,Supercomputer,Computer science,Structural biology,Distributed database,Cluster analysis,Big data,Distributed memory systems,Data content	Conference
ISSN	Citations	PageRank
1949-0828	5	0.49
References	Authors
8	4

Authors (4 rows)

Cited by (5 rows)

References (8 rows)

Name	Order	Citations	PageRank
boyu zhang	1	71	17.54
Trilce Estrada	2	120	18.27
Pietro Cicotti	3	101	14.52
michela taufer	4	352	53.04

1