Title | ||
---|---|---|
On Efficiently Capturing Scientific Properties in Distributed Big Data without Moving the Data: A Case Study in Distributed Structural Biology Using MapReduce |
Abstract | ||
---|---|---|
In this paper, we present two variations of a general analysis algorithm for large datasets residing in distributed memory systems. Both variations avoid the need to move data among nodes because they extract relevant data properties locally and concurrently and transform the analysis problem (e.g., clustering or classification) into a search for property aggregates. We test the two variations using the SDSC's supercomputer Gordon, the MapReduce-MPI library, and a structural biology dataset of 100 million protein-ligand records. We evaluate both variations for their sensitivity to data distribution and load imbalance. Our observations indicate that the first variation is sensitive to data content and distribution while the second variation is not. Moreover, the second variation can self-heal load imbalance and it outperforms the first in all the fifteen cases considered. |
Year | DOI | Venue |
---|---|---|
2013 | 10.1109/CSE.2013.28 | C3S2E |
Keywords | Field | DocType |
memory system,analysis problem,data content,big data,large datasets,case study,relevant data property,data distribution,mapreduce-mpi library,general analysis algorithm,fifteen case,load imbalance,structural biology,efficiently capturing scientific properties,data analysis,distributed databases | Data mining,First variation,Supercomputer,Computer science,Structural biology,Distributed database,Cluster analysis,Big data,Distributed memory systems,Data content | Conference |
ISSN | Citations | PageRank |
1949-0828 | 5 | 0.49 |
References | Authors | |
8 | 4 |
Name | Order | Citations | PageRank |
---|---|---|---|
boyu zhang | 1 | 71 | 17.54 |
Trilce Estrada | 2 | 120 | 18.27 |
Pietro Cicotti | 3 | 101 | 14.52 |
michela taufer | 4 | 352 | 53.04 |