Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission - Citegraph

Paper Info

Title
Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission

Abstract
Data warehousing applications represent an emergent application arena that requires the processing of relational queries and computations over massive amounts of data. Modern general purpose GPUs are high core count architectures that potentially offer substantial improvements in throughput for these applications. However, there are significant challenges that arise due to the overheads of data movement through the memory hierarchy and between the GPU and host CPU. This paper proposes a set of compiler optimizations to address these challenges. Inspired in part by loop fusion/fission optimizations in the scientific computing community, we propose kernel fusion and kernel fission. Kernel fusion fuses the code bodies of two GPU kernels to i) eliminate redundant operations across dependent kernels, ii) reduce data movement between GPU registers and GPU memory, iii) reduce data movement between GPU memory and CPU memory, and iv) improve spatial and temporal locality of memory references. Kernel fission partitions a kernel into segments such that segment computations and data transfers between the GPU and host CPU can be overlapped. Fusion and fission can also be applied concurrently to a set of kernels. We empirically evaluate the benefits of fusion/fission on relational algebra operators drawn from the TPC-H benchmark suite. All kernels are implemented in CUDA and the experiments are performed with NVIDIA Fermi GPUs. In general, we observed data throughput improvements ranging from 13.1% to 41.4% for the SELECT operator and queries Q1 and Q21 in the TPC-H benchmark suite. We present key insights, lessons learned, and opportunities for further improvements.

Year	DOI	Venue
2012	10.1109/IPDPSW.2012.300	IPDPS Workshops
Keywords	Field	DocType
gpu memory,kernel fission,graphics processing unit,segment computations,gpu,data warehouses,compiler,loop fusion optimization,data throughput improvements,storage management,general purpose gpu,data throughput improvement,parallel architectures,relational algebra,graphics processing units,redundant operation elimination,data warehousing application,tpc-h benchmark suite,relational algebra operators,data transfers,nvidia fermi gpu,optimizing data warehousing applications,relational query processing,host cpu,cuda,gpu kernel,data movement,optimization,kernel fusion,scientific computing community,data movement reduction,compiler optimizations,memory reference temporal locality,data warehousing,optimising compilers,memory reference spatial locality,gpu registers,relational computation processing,data warehousing applications,loop fission optimization,cpu memory,query processing,data transfer,throughput,warehousing,bandwidth,memory management,kernel	Data warehouse,Loop fusion,Locality of reference,Memory hierarchy,Computer science,CUDA,Parallel computing,Optimizing compiler,Memory management,Graphics processing unit,Distributed computing	Conference
ISSN	ISBN	Citations
2164-7062	978-1-4673-0974-5	28
PageRank	References	Authors
1.06	19	6

Authors (6 rows)

Cited by (28 rows)

References (19 rows)

Name	Order	Citations	PageRank
Haicheng Wu	1	204	8.42
Gregory Frederick Diamos	2	1117	51.07
Jin Wang	3	117	5.80
Srihari Cadambi	4	527	37.06
Sudhakar Yalamanchili	5	1836	184.95
Srimat T. Chakradhar	6	2492	185.94

1