A software scheduling solution to avoid corrupted units on GPUs. - Citegraph

Paper Info

Title
A software scheduling solution to avoid corrupted units on GPUs.

Abstract
Massively parallel processors provide high computing performance by increasing the number of concurrent execution units. Moreover, the transistor technology evolves to higher density, higher frequency and lower voltage. The combination of these factors increases significantly the probability of hardware failures. In this paper, we present a methodology to locate and mitigate hardware failures of Nvidia GPUs. Results show that intermittent errors can be precisely localized and have a limited impact to a well defined architecture tile. Therefore, we propose, and demonstrate on a software prototype, a rescheduling strategy to quarantine the defective hardware and ensure correct execution. Our approach significantly improves the GPU fault-tolerance capability and GPU’s lifespan, at a reasonable overhead.

Year	DOI	Venue
2016	10.1016/j.jpdc.2016.01.001	Journal of Parallel and Distributed Computing
Keywords	Field	DocType
Reliability,GPGPU,Intermittent error,Scheduling,Fault tolerance	Computer science,Massively parallel,Scheduling (computing),Parallel computing,Voltage,Software,Fault tolerance,General-purpose computing on graphics processing units,Transistor,Distributed computing,Embedded system	Journal
Volume	ISSN	Citations
90	0743-7315	1
PageRank	References	Authors
0.40	16	2

Authors (2 rows)

Cited by (1 rows)

References (16 rows)

Name	Order	Citations	PageRank
David Defour	1	131	18.28
Eric Petit	2	58	12.73

1