Title
Efficient Intra-SM Slicing through Dynamic Resource Partitioning for GPU Multiprogramming.
Abstract
Conventional translation look-aside buffers(TLBs) are required to complete address translation withshort latencies, as the address translation is on the criticalpath of all memory accesses even for L1 cache hits. Such strictTLB latency restrictions limit the TLB capacity, as the latencyincrease with large TLBs may lower the overall performanceeven with potential TLB miss reductions. Furthermore, TLBsconsume a significant amount of energy as they are accessedfor every instruction fetch and data access. To avoid thelatency restriction and reduce the energy consumption, virtualcaching techniques have been proposed to defer translation toafter L1 cache misses. However, an efficient solution for thesynonym problem has been a critical issue hindering the wideadoption of virtual caching.Based on the virtual caching concept, this study proposes ahybrid virtual memory architecture extending virtual cachingto the entire cache hierarchy, aiming to improve both performanceand energy consumption. The hybrid virtual cachinguses virtual addresses augmented with address space identifiers(ASID) in the cache hierarchy for common non-synonymaddresses. For such non-synonyms, the address translationoccurs only after last-level cache (LLC) misses. For uncommonsynonym addresses, the addresses are translated to physicaladdresses with conventional TLBs before L1 cache accesses. Tosupport such hybrid translation, we propose an efficient synonymdetection mechanism based on Bloom filters which canidentify synonym candidates with few false positives. For largememory applications, delayed translation alone cannot solvethe address translation problem, as fixed-granularity delayedTLBs may not scale with the increasing memory requirements.To mitigate the translation scalability problem, this studyproposes a delayed many segment translation designed for thehybrid virtual caching. The experimental results show that ourapproach effectively lowers accesses to the TLBs, leading tosignificant power savings. In addition, the approach providesperformance improvement with scalable delayed translationwith variable length segments.
Year
Venue
Field
2016
ISCA
Address space,Bloom filter,Computer science,Cache,Virtual memory,CPU cache,Parallel computing,Cache algorithms,Real-time computing,Cache coloring,Translation lookaside buffer
DocType
Citations 
PageRank 
Conference
2
0.36
References 
Authors
9
3
Name
Order
Citations
PageRank
Chang Hyun Park1264.15
Taekyung Heo220.36
Jaehyuk Huh3100863.91