AWB-GCN: A Graph Convolutional Network Accelerator with Runtime Workload Rebalancing | 9 | 0.50 | 2020 |
Pushing the Limits of Narrow Precision Inferencing at Cloud Scale with Microsoft Floating Point. | 0 | 0.34 | 2020 |
Inside Project Brainwave's Cloud-Scale, Real-Time AI Processor. | 0 | 0.34 | 2019 |
ComP-net: command processor networking for efficient intra-kernel communications on GPUs | 0 | 0.34 | 2018 |
A Configurable Cloud-Scale DNN Processor for Real-Time AI. | 31 | 1.25 | 2018 |
Generic System Calls for GPUs. | 3 | 0.37 | 2018 |
Design and Analysis of an APU for Exascale Computing | 11 | 0.56 | 2017 |
If You Build It, Will They Come? | 3 | 0.36 | 2017 |
GPU triggered networking for intra-kernel communications | 3 | 0.40 | 2017 |
Programming GPGPU Graph Applications with Linear Algebra Building Blocks. | 5 | 0.53 | 2017 |
Gravel: fine-grain GPU-initiated network messages | 1 | 0.36 | 2017 |
Extended task queuing: active messages for heterogeneous systems. | 3 | 0.39 | 2016 |
Graph Coloring on the GPU and Some Techniques to Improve Load Imbalance | 1 | 0.40 | 2015 |
Achieving Exascale Capabilities through Heterogeneous Computing | 14 | 0.61 | 2015 |
BelRed: Constructing GPGPU graph applications with software building blocks | 2 | 0.43 | 2014 |
Heterogeneous-race-free memory models | 39 | 1.13 | 2014 |
QuickRelease: A throughput-oriented approach to release consistency on GPUs | 28 | 0.94 | 2014 |
Fine-grain task aggregation and coordination on GPUs | 14 | 0.63 | 2014 |
Pannotia: Understanding irregular GPGPU graph applications | 64 | 1.68 | 2013 |
Heterogeneous system coherence for integrated CPU-GPU systems | 47 | 1.40 | 2013 |
The gem5 simulator | 853 | 24.92 | 2011 |
Towards the ideal on-chip fabric for 1-to-many and many-to-1 communication | 25 | 0.94 | 2011 |
Server Designs for Warehouse-Computing Environments | 4 | 0.82 | 2009 |
End-to-end performance forecasting: finding bottlenecks before they happen | 6 | 0.53 | 2009 |
Full-System Critical Path Analysis | 7 | 0.56 | 2008 |
Analysis of hardware prefetching across virtual page boundaries | 4 | 0.46 | 2007 |
The M5 Simulator: Modeling Networked Systems | 458 | 26.59 | 2006 |
Communist, utilitarian, and capitalist cache policies on CMPs: caches as a shared resource | 107 | 6.26 | 2006 |
A unified compressed memory hierarchy | 50 | 2.47 | 2005 |
Exploring the cache design space for large scale CMPs | 42 | 1.71 | 2005 |
Performance Analysis of System Overheads in TCP/IP Workloads | 19 | 1.19 | 2005 |
The soft error problem: an architectural perspective | 202 | 8.52 | 2005 |
How to Fake 1000 Registers | 17 | 0.71 | 2005 |
Reducing the soft-error rate of a high-performance microprocessor | 9 | 0.67 | 2004 |
Cache Scrubbing in Microprocessors: Myth or Necessity? | 67 | 4.92 | 2004 |
A compressed memory hierarchy using an indirect index cache | 17 | 0.87 | 2004 |
The Impact of Resource Partitioning on SMT Processors | 67 | 3.00 | 2003 |
Guided Region Prefetching: A Cooperative Hardware/Software Approach. | 0 | 0.34 | 2003 |
Measuring Architectural Vulnerability Factors | 17 | 1.09 | 2003 |
A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor | 460 | 24.57 | 2003 |
A scalable instruction queue design using dependence chains | 64 | 2.44 | 2002 |
Detailed design and evaluation of redundant multi-threading alternatives | 230 | 17.21 | 2002 |
Designing a modern memory hierarchy with hardware prefetching | 22 | 1.42 | 2001 |
Reducing DRAM Latencies with an Integrated Memory Hierarchy Design | 103 | 8.24 | 2001 |
Integrating hardware and software concepts in a microprocessor-based system design lab | 1 | 0.73 | 2000 |
A fully associative software-managed cache design | 79 | 8.10 | 2000 |
Transient fault detection via simultaneous multithreading | 317 | 18.15 | 2000 |
Hardware Support for Flexible Distributed Shared Memory | 1 | 0.41 | 1998 |
Retrospective: tempest and typhoon: user-level shared memory | 1 | 0.35 | 1998 |
Decoupled hardware support for distributed shared memory | 41 | 2.53 | 1996 |