Online Detection and Classification of State Transitions of Multivariate Shock and Vibration Data | 0 | 0.34 | 2022 |
Resiliency in numerical algorithm design for extreme scale simulations | 0 | 0.34 | 2022 |
Understanding the Effects of DRAM Correctable Error Logging at Scale | 0 | 0.34 | 2021 |
Thermal neutrons: a possible threat for supercomputer reliability | 1 | 0.40 | 2021 |
Quantifying Server Memory Frequency Margin and Using It to Improve Performance in HPC Systems | 1 | 0.35 | 2021 |
Extreme Protection Against Data Loss with Single-Overlap Declustered Parity | 0 | 0.34 | 2020 |
An Overview of the Risk Posed by Thermal Neutrons to the Reliability of Computing Devices | 0 | 0.34 | 2020 |
Thermal Neutrons: a Possible Threat for Supercomputers and Safety Critical Applications | 1 | 0.41 | 2020 |
TensorFI: A Flexible Fault Injection Framework for TensorFlow Applications | 2 | 0.37 | 2020 |
Chaser: An Enhanced Fault Injection Tool for Tracing Soft Errors in MPI Applications | 0 | 0.34 | 2020 |
Quantifying Memory Underutilization in HPC Systems and Using it to Improve Performance via Architecture Support | 5 | 0.41 | 2019 |
BinFI : an efficient fault injector for safety-critical machine learning systems | 12 | 0.63 | 2019 |
TSM2: optimizing tall-and-skinny matrix-matrix multiplication on GPUs | 3 | 0.39 | 2019 |
Do Solar Proton Events Reduce the Number of Faults in Supercomputers?: A Comparative Analysis of Faults During and without Solar Proton Events | 0 | 0.34 | 2019 |
SaNSA - The Supercomputer and Node State Architecture | 0 | 0.34 | 2018 |
Characterization and Comparison of Application Resilience for Serial and Parallel Executions. | 1 | 0.35 | 2018 |
Enhancing HPC System Log Analysis by Identifying Message Origin in Source Code | 0 | 0.34 | 2018 |
Modeling Application Resilience In Large-Scale Parallel Execution | 0 | 0.34 | 2018 |
Using virtualization to quantify power conservation via near-threshold voltage reduction for inherently resilient applications. | 1 | 0.35 | 2018 |
Lessons learned from memory errors observed over the lifetime of Cielo. | 5 | 0.43 | 2018 |
Improving Application Resilience by Extending Error Correction with Contextual Information | 0 | 0.34 | 2018 |
Physics-Informed Machine Learning for DRAM Error Modeling | 0 | 0.34 | 2018 |
The Atlas Cluster Trace Repository. | 0 | 0.34 | 2018 |
RSVP: Soft Error Resilient Power Savings at Near-Threshold Voltage Using Register Vulnerability | 0 | 0.34 | 2017 |
Addressing statistical significance of fault injection: empirical studies of the soft error susceptibility. | 1 | 0.35 | 2017 |
Experimental and Analytical Study of Xeon Phi Reliability | 2 | 0.36 | 2017 |
Resilience Analysis of Top K Selection Algorithms | 0 | 0.34 | 2017 |
LetGo: A Lightweight Continuous Framework for HPC Applications Under Failures. | 4 | 0.43 | 2017 |
Silent Data Corruption Resilient Two-sided Matrix Factorizations. | 6 | 0.42 | 2017 |
Automating DRAM Fault Mitigation By Learning From Experience | 1 | 0.35 | 2017 |
Improving DRAM Fault Characterization through Machine Learning | 4 | 0.42 | 2016 |
SDC is in the Eye of the Beholder: A Survey and Preliminary Study | 2 | 0.36 | 2016 |
Design, Use and Evaluation of P-FSEFI: A Parallel Soft Error Fault Injection Framework for Emulating Soft Errors in Parallel Applications. | 1 | 0.35 | 2016 |
On the Inherent Resilience of Integer Operations. | 0 | 0.34 | 2016 |
Towards Practical Algorithm Based Fault Tolerance in Dense Linear Algebra. | 9 | 0.48 | 2016 |
Differentiated Failure Remediation with Action Selection for Resilient Computing | 1 | 0.36 | 2015 |
Empirical Studies of the Soft Error Susceptibility ofSorting Algorithms to Statistical Fault Injection | 3 | 0.42 | 2015 |
Towards Building Resilient Scientific Applications: Resilience Analysis on the Impact of Soft Error and Transient Error Tolerance with the CLAMR Hydrodynamics Mini-App | 3 | 0.42 | 2015 |
Memory Errors in Modern Systems: The Good, The Bad, and The Ugly | 75 | 1.67 | 2015 |
On the Non-Suitability of Non-Volatility | 2 | 0.38 | 2015 |
Understanding GPU errors on large-scale HPC systems and the implications for system design and operation | 49 | 1.67 | 2015 |
Harnessing Unreliable Cores in Heterogeneous Architecture: The PyDac Programming Model and Runtime | 0 | 0.34 | 2014 |
F-SEFI: A Fine-Grained Soft Error Fault Injection Tool for Profiling Application Vulnerability | 23 | 1.12 | 2014 |
GPGPUs: how to combine high computational power with high reliability | 14 | 0.85 | 2014 |
Addressing failures in exascale computing | 123 | 3.22 | 2014 |
Fault Injection Experiments with the CLAMR Hydrodynamics Mini-App | 2 | 0.40 | 2014 |
PyDac: A Resilient Run-Time Framework for Divide-and-Conquer Applications on a Heterogeneous Many-Core Architecture. | 1 | 0.38 | 2013 |
Feng shui of supercomputer memory: positional effects in DRAM and SRAM faults | 69 | 1.91 | 2013 |
Analyzing Reliability of Memory Sub-systems with Double-Chipkill Detect/Correct | 4 | 0.46 | 2013 |
Exploring Time and Frequency Domains for Accurate and Automated Anomaly Detection in Cloud Computing Systems | 4 | 0.43 | 2013 |