Title
ARC: An Automated Approach to Resiliency for Lossy Compressed Data via Error Correcting Codes
Abstract
ABSTRACTProgress in high-performance computing (HPC) systems has led to complex applications that stress the I/O subsystem by creating vast amounts of data. Lossy compression reduces data size considerably, but a single error renders lossy compressed data unusable. This sensitivity stems from the high information content per bit in compressed data and is a critical issue as soft errors that cause bit-flips have become increasingly commonplace in HPC systems. While many works have improved lossy compressor performance, few have sought to address this critical weakness. This paper presents ARC: Automated Resiliency for Compression. Given user-defined constraints on storage, throughput, and resiliency, ARC automatically determines the optimal error-correcting code (ECC) configuration before encoding data. We conduct an extensive fault injection study to fully understand the effects of soft errors on lossy compressed data and how to best protect it. We evaluate ARC's scalability, performance, resiliency, and ease of use. We find on a 40 core node that encoding and decoding demonstrate throughput up to 3730 MB/s and 3602 MB/s. ARC also detects and corrects multi-bit errors with a tunable overhead in terms of storage and throughput. Finally, we display the ease of using ARC and how to consider a systems failure rate when determining the constraints.
Year
DOI
Venue
2021
10.1145/3431379.3460638
HPDC
DocType
Citations 
PageRank 
Conference
1
0.35
References 
Authors
0
4
Name
Order
Citations
PageRank
Dakota Fulp110.35
Alexandra Poulos210.35
Robert Underwood310.68
Jon C. Calhoun433.41