Title
Reengineering High-throughput Molecular Datasets for Scalable Clustering Using MapReduce
Abstract
We propose a linear clustering approach for large datasets of molecular geometries produced by high-throughput molecular dynamics simulations (e.g., protein folding and protein-ligand docking simulations). To this scope, we transform each three-dimensional (3D) molecular conformation into a single point in the 3D space reducing the space complexity while still encoding the molecular similarities and geometries. We assign an identifier to each single 3D point mapping a docked ligand, generate a tree from the whole space, and apply a tree-based clustering on the reduced conformation space that identifies most dense hyperspaces. We adapt our method for MapReduce and implement it in Hadoop. The load-balancing, fault-tolerance, and scalability in MapReduce allows screening of very large conformation datasets not approachable with traditional clustering methods. We analyze results for datasets with different concentrations of optimal solutions, and draw conclusions about the limitations and usability of our method. The advantages of this approach make it attractive for complex applications in real-world high-throughput molecular simulations.
Year
DOI
Venue
2012
10.1109/HPCC.2012.54
HPCC-ICESS
Keywords
Field
DocType
scalable clustering,molecular conformation,reengineering high-throughput molecular datasets,large conformation,high-throughput molecular dynamics simulation,whole space,real-world high-throughput molecular simulation,large datasets,reduced conformation space,molecular geometries,molecular similarity,space complexity,molecular docking,computational complexity,computational modeling,tree data structures,proteins,geometry,solid modeling,distributed programming,resource allocation,scalability,octree,software fault tolerance,fault tolerance,load balancing,clustering algorithms,computational geometry,public domain software
Data mining,Identifier,Load balancing (computing),Computer science,Parallel computing,Tree (data structure),Theoretical computer science,Cluster analysis,Scalability,Encoding (memory),Octree,Computational complexity theory
Conference
ISSN
Citations 
PageRank 
2576-3504
6
0.51
References 
Authors
9
5
Name
Order
Citations
PageRank
Trilce Estrada112018.27
boyu zhang27117.54
michela taufer335253.04
Pietro Cicotti410114.52
Roger Armen5262.87