Title
YinMem: A distributed parallel indexed in-memory computation system for large scale data analytics
Abstract
Machine learning and graph analytics typically process data in an iterative way, reading the same data multiple times and sharing intermediate results across the worker nodes in cluster. Hadoop MapReduce and Spark are two popular open source cluster compute frameworks for large scale data analytics. Apache Spark is currently the state-of-the-art in-memory computation model extending MapReduce by transforming data into RDDs stored in memory. One limitation of Spark, however, lies in the fact that data transformation and distribution is implicitly managed by HDFS. Data locality is not guaranteed for iterative machine learning algorithms which read the same data multiple times. For example, data needed for operations to one worker node might reside in RDDs stored in other worker nodes. The resulting data shuffling becomes a bottleneck when iteratively reading such RDDs. We propose YinMem, a parallel distributed indexed in-memory computation system, bridging the gap between Hadoop ecosystem and HPC by replacing MapReduce with MPI while obtaining the advantage of the distributed data storage. YinMem achieves fair load balancing prior to computation for large sparse matrix by scheduling and distributing indexed data from NoSQL database to the RAM of working nodes. YinMem explores Alluxio as the in-memory storage system and enables efficient data sharing of intermediate results. Preliminary results show that YinMem has achieved 3× speedup to Spark, for computing eigenvalue and eigenvectors of a 16-million scale sparse matrix.
Year
DOI
Venue
2016
10.1109/BigData.2016.7840607
2016 IEEE International Conference on Big Data (Big Data)
Keywords
Field
DocType
NoSQL database,large sparse matrix,distributed data storage,MPI,HPC,iterative machine learning algorithms,memory computation model,Spark,Hadoop MapReduce,graph analytics,machine learning,large scale data analytics,distributed parallel indexed in-memory computation system,YinMem
Data mining,Spark (mathematics),Data analysis,Computer science,Computer data storage,NoSQL,Artificial intelligence,Sparse matrix,Speedup,Load balancing (computing),Distributed data store,Parallel computing,Machine learning
Conference
ISBN
Citations 
PageRank 
978-1-4673-9006-4
0
0.34
References 
Authors
17
5
Name
Order
Citations
PageRank
Yin Huang100.34
Yelena Yesha21756253.96
Milton Halem38629.78
Yaacov Yesha440658.33
Shujia Zhou521617.50