Neither more nor less: optimizing thread-level parallelism for GPGPUs - Citegraph

Paper Info

Title
Neither more nor less: optimizing thread-level parallelism for GPGPUs

Abstract
General-purpose graphics processing units (GPGPUs) are at their best in accelerating computation by exploiting abundant thread-level parallelism (TLP) offered by many classes of HPC applications. To facilitate such high TLP, emerging programming models like CUDA and OpenCL allow programmers to create work abstractions in terms of smaller work units, called cooperative thread arrays (CTAs). CTAs are groups of threads and can be executed in any order, thereby providing ample opportunities for TLP. The state-of-the-art GPGPU schedulers allocate maximum possible CTAs per-core (limited by available on-chip resources) to enhance performance by exploiting TLP. However, we demonstrate in this paper that executing the maximum possible number of CTAs on a core is not always the optimal choice from the performance perspective. High number of concurrently executing threads might cause more memory requests to be issued, and create contention in the caches, network and memory, leading to long stalls at the cores. To reduce resource contention, we propose a dynamic CTA scheduling mechanism, called DYNCTA, which modulates the TLP by allocating optimal number of CTAs, based on application characteristics. To minimize resource contention, DYNCTA allocates fewer CTAs for applications suffering from high contention in the memory sub-system, compared to applications demonstrating high throughput. Simulation results on a 30-core GPGPU platform with 31 applications show that the proposed CTA scheduler provides 28% average improvement in performance compared to the existing CTA scheduler.

Year	DOI	Venue
2013	10.1109/PACT.2013.6618813	PACT
Keywords	Field	DocType
graphics processing units,multi-threading,parallel architectures,parallel memories,processor scheduling,resource allocation,CTAs per-core,DYNCTA,GPGPU schedulers,HPC applications,TLP,cache contention,cooperative thread arrays,dynamic CTA scheduling mechanism,general-purpose graphics processing units,memory requests,memory subsystem,resource contention,thread-level parallelism,GPGPUs,scheduling,thread-level parallelism	Multithreading,Programming paradigm,Task parallelism,Computer science,Scheduling (computing),CUDA,Parallel computing,Real-time computing,Thread (computing),Resource allocation,General-purpose computing on graphics processing units	Conference
ISSN	ISBN	Citations
1089-795X	978-1-4799-1021-2	90
PageRank	References	Authors
2.35	28	4

Authors (4 rows)

Cited by (90 rows)

References (28 rows)

Name	Order	Citations	PageRank
Onur Kayıran	1	356	13.47
Adwait Jog	2	568	23.32
Mahmut T. Kandemir	3	7371	568.54
Chita R. Das	4	1046	45.21

1