Title
Active memory operations
Abstract
The performance of memory-intensive applications is often limited by how fast the memory system can provide needed data. For local memory, the speed gap between the CPU and DRAMS leads to significant stalls when there is not enough locality in the applications' memory references for caches to be effective. For remote memory, the increasing network latency, in terms of processor clock periods, makes internode communications inordinately expensive. Caching is the standard solution to mitigate this problem. However, for applications with poor memory locality, caches do not improve performance. Also, in a cache-coherent, non-uniform memory access (cc-NUMA) system, the multiple nonoverlappable network latencies dictated by a write-invalidate coherence protocol often exacerbate the memory latency problem. Bi-section bandwidth in large-scale DSM systems is also a limiting factor for data-intensive parallel applications. As a result, reducing local memory latency, remote coherence traffic, and the number of internode data transfers is essential for multiprocessor systems to scale effectively. In general, moving data through the memory system and memory hierarchy into caches and subsequently evicting the data out of the processor core is inefficient if the data are not reused sufficiently. To attack this problem, we propose the use of Active Memory Operations (AMOs), in which select operations can be sent to and executed on data's home memory controller. AMOs can eliminate a significant number of coherence messages, minimize intranode and internode memory traffic, and create opportunities for parallelism. We present an implementation of AMOs that is cache-coherent and requires no changes to the processor core or DRAM chips. In this dissertation, we present architectural and programming models for AMOs, and compare their performance to that of several other memory architectures on a variety of scientific and commercial benchmarks. Through simulation, we show that AMOs offer dramatic performance improvements for an important set of data-intensive operations, e.g., up to 50X faster barriers, 12X faster spinlocks, 8.5X-15X faster stream/array operations, and 3X faster sequential-scan database queries. We further show that this impressive performance can be provided with little chip overhead. Based on a standard cell implementation, the circuitry required to support AMOs is predicted to be less than 1% of the typical chip area of a high performance microprocessor. AMOs are more energy-efficient than current mainstream microprocessors for AMO-optimized applications, and offer great power saving opportunities.
Year
DOI
Venue
2007
10.1145/1274971.1275004
Proceedings of the 21st annual international conference on Supercomputing
Keywords
DocType
ISBN
memory system,distributed shared memory,stream processing,memory latency problem,home memory controller,dramatic performance improvement,memory architecture,dram,memory hierarchy,memory performance,local memory,active memory operation,memory reference,remote memory latency,faster barrier,local memory latency,faster database query,large-scale shared memory system,main memory latency,internode memory traffic,thread synchronization,faster spinlocks,non-uniform memory access,cache coher- ence
Conference
0-542-57792-5
Citations 
PageRank 
References 
14
0.61
86
Authors
2
Name
Order
Citations
PageRank
John B. Carter11785162.82
Zhen Fang2494.45