YinMem: A distributed parallel indexed in-memory computation system for large scale data analytics - Citegraph

Paper Info

Title
YinMem: A distributed parallel indexed in-memory computation system for large scale data analytics

Abstract
Machine learning and graph analytics typically process data in an iterative way, reading the same data multiple times and sharing intermediate results across the worker nodes in cluster. Hadoop MapReduce and Spark are two popular open source cluster compute frameworks for large scale data analytics. Apache Spark is currently the state-of-the-art in-memory computation model extending MapReduce by transforming data into RDDs stored in memory. One limitation of Spark, however, lies in the fact that data transformation and distribution is implicitly managed by HDFS. Data locality is not guaranteed for iterative machine learning algorithms which read the same data multiple times. For example, data needed for operations to one worker node might reside in RDDs stored in other worker nodes. The resulting data shuffling becomes a bottleneck when iteratively reading such RDDs. We propose YinMem, a parallel distributed indexed in-memory computation system, bridging the gap between Hadoop ecosystem and HPC by replacing MapReduce with MPI while obtaining the advantage of the distributed data storage. YinMem achieves fair load balancing prior to computation for large sparse matrix by scheduling and distributing indexed data from NoSQL database to the RAM of working nodes. YinMem explores Alluxio as the in-memory storage system and enables efficient data sharing of intermediate results. Preliminary results show that YinMem has achieved 3× speedup to Spark, for computing eigenvalue and eigenvectors of a 16-million scale sparse matrix.

Year	DOI	Venue
2016	10.1109/BigData.2016.7840607	2016 IEEE International Conference on Big Data (Big Data)
Keywords	Field	DocType
NoSQL database,large sparse matrix,distributed data storage,MPI,HPC,iterative machine learning algorithms,memory computation model,Spark,Hadoop MapReduce,graph analytics,machine learning,large scale data analytics,distributed parallel indexed in-memory computation system,YinMem	Data mining,Spark (mathematics),Data analysis,Computer science,Computer data storage,NoSQL,Artificial intelligence,Sparse matrix,Speedup,Load balancing (computing),Distributed data store,Parallel computing,Machine learning	Conference
ISBN	Citations	PageRank
978-1-4673-9006-4	0	0.34
References	Authors
17	5

Authors (5 rows)

Cited by (0 rows)

References (17 rows)

Name	Order	Citations	PageRank
Yin Huang	1	0	0.34
Yelena Yesha	2	1756	253.96
Milton Halem	3	86	29.78
Yaacov Yesha	4	406	58.33
Shujia Zhou	5	216	17.50

1