Abstract | ||
---|---|---|
The OpenCL standard allows targeting a large variety of CPU, GPU and accelerator architectures using a single unified programming interface and language. But guaranteeing portability relies heavily on platform-specific implementations. In this paper, we provide an OpenCL implementation on an ARMv8 multi-core CPU, which efficiently maps the generic OpenCL platform model to the ARMv8 multi-core architecture. With this implementation, we first characterize the maximum achieved arithmetic throughput and memory accessing bandwidth on the architecture, and measure the OpenCL-related overheads. Our results demonstrate that there exists an optimization room for improving OpenCL kernel performance. Then, we compare the performance of OpenCL against serial codes and OpenMP codes with 11 benchmarks. The experimental results show that (1) the OpenCL implementation can achieve an average speedup of 6x compared to its OpenMP counterpart, and (2) the GPU-specified OpenCL codes are often unsuitable for this ARMv8 multi-core CPU. |
Year | DOI | Venue |
---|---|---|
2017 | 10.1109/ISPA/IUCC.2017.00131 | 2017 15TH IEEE INTERNATIONAL SYMPOSIUM ON PARALLEL AND DISTRIBUTED PROCESSING WITH APPLICATIONS AND 2017 16TH IEEE INTERNATIONAL CONFERENCE ON UBIQUITOUS COMPUTING AND COMMUNICATIONS (ISPA/IUCC 2017) |
Keywords | Field | DocType |
OpenCL, FT-1500A, performance, programming | Kernel (linear algebra),Computer science,Parallel computing,Implementation,Bandwidth (signal processing),Human–computer interaction,Software portability,Throughput,Multi-core processor,Speedup | Conference |
ISSN | Citations | PageRank |
2158-9178 | 1 | 0.35 |
References | Authors | |
0 | 5 |
Name | Order | Citations | PageRank |
---|---|---|---|
Jianbin Fang | 1 | 265 | 25.31 |
Peng Zhang | 2 | 48 | 5.09 |
Tao Tang | 3 | 42 | 7.44 |
Chun Huang | 4 | 13 | 8.00 |
Canqun Yang | 5 | 188 | 29.39 |