A data-centric optimization framework for machine learning | 0 | 0.34 | 2022 |
Lifting C semantics for dataflow optimization | 0 | 0.34 | 2022 |
Clairvoyant prefetching for distributed machine learning I/O | 2 | 0.37 | 2021 |
On the parallel I/O optimality of linear algebra kernels: near-optimal matrix factorizations | 0 | 0.34 | 2021 |
Pebbles, Graphs, and a Pinch of Combinatorics: Towards Tight I/O Lower Bounds for Statically Analyzable Programs | 0 | 0.34 | 2021 |
Breaking (Global) Barriers in Parallel Stochastic Optimization With Wait-Avoiding Group Averaging | 1 | 0.36 | 2021 |
StencilFlow: Mapping Large Stencil Programs to Distributed Spatial Computing Systems | 0 | 0.34 | 2021 |
On the parallel I/O optimality of linear algebra kernels: near-optimal LU factorization | 0 | 0.34 | 2021 |
Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks | 0 | 0.34 | 2021 |
Programl: A Graph-Based Program Representation For Data Flow Analysis And Compiler Optimizations | 0 | 0.34 | 2021 |
Workflows are the New Applications: Challenges in Performance, Portability, and Productivity | 1 | 0.35 | 2020 |
Augment Your Batch: Improving Generalization Through Instance Repetition | 0 | 0.34 | 2020 |
Substream-Centric Maximum Matchings on FPGA | 0 | 0.34 | 2020 |
Groute: Asynchronous Multi-GPU Programming Model with Applications to Large-scale Graph Processing | 0 | 0.34 | 2020 |
Taming unbalanced training workloads in deep learning with partial collective operations. | 4 | 0.46 | 2020 |
Stateful dataflow multigraphs: a data-centric model for performance portability on heterogeneous architectures | 1 | 0.37 | 2019 |
A Modular Benchmarking Infrastructure for High-Performance and Reproducible Deep Learning | 3 | 0.46 | 2019 |
Stateful Dataflow Multigraphs: A Data-Centric Model for High-Performance Parallel Programs. | 0 | 0.34 | 2019 |
Substream-Centric Maximum Matchings on FPGA. | 1 | 0.40 | 2019 |
Augment your batch: better training with larger batches. | 1 | 0.35 | 2019 |
Graph Processing on FPGAs: Taxonomy, Survey, Challenges. | 0 | 0.34 | 2019 |
Optimizing the data movement in quantum transport simulations via data-centric parallel programming | 0 | 0.34 | 2019 |
Accelerating Deep Learning Frameworks with Micro-Batches | 3 | 0.40 | 2018 |
Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis. | 46 | 1.44 | 2018 |
Neural Code Comprehension: A Learnable Representation of Code Semantics. | 1 | 0.35 | 2018 |
Optimizing Parallel Graph Connectivity Computation via Subgraph Sampling | 4 | 0.40 | 2018 |
Big data causing big (TLB) problems: taming random memory accesses on the GPU. | 2 | 0.37 | 2017 |
Groute: An Asynchronous Multi-GPU Programming Model for Irregular Computations. | 20 | 0.61 | 2017 |
Adaptive Work-Efficient Connected Components on the GPU. | 0 | 0.34 | 2016 |
Spline-based parallel nonlinear optimization of function sequences. | 0 | 0.34 | 2016 |
Reciprocal Grids: A Hierarchical Algorithm for Computing Solution X-ray Scattering Curves from Supramolecular Complexes at High Resolution. | 0 | 0.34 | 2016 |
Memory access patterns: the missing piece of the multi-GPU puzzle | 14 | 0.61 | 2015 |
Design and implementation of a generic resource sharing virtual time dispatcher | 3 | 0.51 | 2010 |
A global scheduling framework for virtualization environments | 9 | 0.66 | 2009 |