Performance of Windows Multicore Systems on Threading and MPI - Citegraph

Paper Info

Title
Performance of Windows Multicore Systems on Threading and MPI

Abstract
We present performance results on a Windows cluster with up to 768 cores using MPI and two variants of threading - CCR and TPL. CCR (Concurrency and Coordination Runtime) presents a message based interface while TPL (Task Parallel Library) allows for loops to be automatically parallelized. MPI is used between the cluster nodes (up to 32) and either threading or MPI for parallelism on the 24 cores of each node. We use a simple matrix multiplication kernel as well as a significant bioinformatics gene clustering application. We find that the two threading models offer similar performance with MPI outperforming both at low levels of parallelism but threading much better when the grain size (problem size per process) is small. We find better performance on Intel compared to AMD on comparable 24 core systems. We develop simple models for the performance of the clustering code. Multicore, Performance, Threading, MPI, and Windows I. INTRODUCTION Multicore technology is still rapidly changing at both the hardware and software levels and so it is challenging to understand how to achieve good performance especially with clusters when one needs to consider both distributed and shared memory issues. In this paper we look at both MPI and threading approaches to parallelism for a significant production datamining code running on a 768 core Windows cluster. Efficient use of this code requires that one use a hybrid programming paradigm mixing threading and MPI. Here we quantify this and compare the threading model CCR (Concurrency and Coordination Runtime) used for the last 3 years with Microsoft's new TPL Task Parallel Library. Section II briefly presents the clustering application used in this paper while section III summarizes the three approaches parallelism - CCR, TPL and MPI - used here. Section IV is the heart of paper and looks at the performance of the clustering application with the different software models and as a function of dataset size. We identify the major sources of parallel overhead of which the most important is the usual synchronization and communication overhead. We compare the measured performance with simple one and two factor models which describe most of the performance data well. Both CCR and the newer TPL perform similarly. In section V, we extend study to a matrix multiplication kernel running on single node Intel and AMD 24 core systems where CCR outperforms TPL. Section VI has conclusions. In this paper we mainly use a cluster Tempest which has 32 nodes made up of four Intel Xeon E7450 CPUs at 2.40GHz with 6 cores. Each node has 48 GB node memory and is connected by 20Gbps Infiniband. In section 5, we compare with a single AMD machine that is made up of four AMD Opteron 8356 2.3 GHz chips with 6 cores. This machine has 16 GB memory. All machines run Microsoft Window HPC Server 2008 (Service Pack 1) - 64 bit. Note all software was written in C# and runs in .NET3.5 or .NET4.0 (beta 2) environments.

Year	DOI	Venue
2012	10.1109/CCGRID.2010.105	Concurrency and Computation: Practice and Experience
Keywords	Field	DocType
simple matrix multiplication kernel,threading model,similar performance,problem size,cluster node,clustering code,present performance result,grain size,windows cluster,windows multicore systems,better performance,concurrent computing,concurrency control,data models,grid computing,message passing interface,automatic parallelization,performance,multicore processing,software performance,matrix multiplication,task parallel library,bioinformatics,factor model,message passing,parallel processing,application software,multi threading,chip,programming paradigm,gene cluster,mpi,multicore,annealing,kernel,threading,shared memory	Multithreading,Parallel Extensions,Computer science,Parallel computing,Message Passing Interface,Concurrent computing,Concurrency and Coordination Runtime,Cluster analysis,Multi-core processor,Message passing,Distributed computing	Journal
Volume	Issue	ISBN
24	1	978-1-4244-6987-1
Citations	PageRank	References
3	0.46	5
Authors
5

Authors (5 rows)

Cited by (3 rows)

References (5 rows)

Name	Order	Citations	PageRank
Judy Qiu	1	743	43.25
Scott Beason	2	78	6.04
Seung-Hee Bae	3	571	31.67
Saliya Ekanayake	4	90	9.34
Geoffrey Fox	5	4070	575.38

1