Abstract | ||
---|---|---|
As core counts increase and as heterogeneity becomes more common in parallel computing, we face the prospect of programming hundreds or even thousands of concurrent threads in a single shared-memory system. At these scales, even highly-efficient concurrent algorithms and data structures can become bottlenecks, unless they are designed from the ground up with throughput as their primary goal. In this paper, we present three contributions: (1) a characterization of queue designs in terms of modern multi- and many-core architectures, (2) the design of a high-throughput, linearizable, blocking, concurrent FIFO queue for many-core architectures that avoids the bottlenecks and pitfalls common in modern queue designs, and (3) a thorough evaluation of concurrent queue throughput across CPU, GPU, and co-processor devices. Our evaluation shows that focusing on throughput, rather than progress guarantees, allows our queue to scale to as much as three orders of magnitude (1000x) faster than lock-free and combining queues on GPU platforms and two times (2x) faster on CPU devices. These results deliver critical insights into the design of data structures for highly concurrent systems: (1) progress guarantees do not guarantee scalability, and (2) allowing an algorithm to block can increase throughput. |
Year | DOI | Venue |
---|---|---|
2015 | 10.1145/2668930.2688048 | ICPE |
Keywords | Field | DocType |
heterogeneous,gpu,queue,parallel programming,many-core,accelerator,high performance computing,algorithms | Data structure,Supercomputer,Computer science,Parallel computing,Queue,Thread (computing),Fifo queue,Throughput,Scalability,Distributed computing | Conference |
Citations | PageRank | References |
8 | 0.56 | 13 |
Authors | ||
2 |
Name | Order | Citations | PageRank |
---|---|---|---|
Thomas R. W. Scogland | 1 | 87 | 7.56 |
Wu-chun Feng | 2 | 2812 | 232.50 |