Title
Java thread and process performance for parallel machine learning on multicore HPC clusters
Abstract
The growing use of Big Data frameworks on large machines highlights the importance of performance issues and the value of High Performance Computing (HPC) technology. This paper looks carefully at three major frameworks Spark, Flink and Message Passing Interface (MPI) both in scaling across nodes and internally over the many cores inside modern nodes. We focus on the special challenges of the Java Virtual Machine (JVM) using an Intel Haswell HPC cluster with 24 cores per node. Two parallel machine learning algorithms, K-Means clustering and Multidimensional Scaling (MDS) are used in our performance studies. We identify three major issues - thread models, affinity patterns, and communication mechanisms - as factors affecting performance by large factors and show how to optimize them so that Java can match the performance of traditional HPC languages like C. Further we suggest approaches that preserve the user interface and elegant dataflow approach of Flink and Spark but modify the runtime so that these Big Data frameworks can achieve excellent performance and realize the goals of HPC-Big Data convergence.
Year
DOI
Venue
2016
10.1109/BigData.2016.7840622
2016 IEEE International Conference on Big Data (Big Data)
Keywords
Field
DocType
Big Data,Machine Learning,Java,Multicore,HPC
Spark (mathematics),Supercomputer,Computer science,Parallel computing,Thread (computing),Message Passing Interface,Dataflow,Artificial intelligence,User interface,Multi-core processor,Java,Machine learning
Conference
ISBN
Citations 
PageRank 
978-1-4673-9006-4
4
0.44
References 
Authors
19
4
Name
Order
Citations
PageRank
Saliya Ekanayake1909.34
Supun Kamburugamuve2759.21
Pulasthi Wickramasinghe340.78
Geoffrey Fox44070575.38