Zobrazeno 1 - 10
of 17
pro vyhledávání: '"Christopher I. Rodrigues"'
Publikováno v:
Languages and Compilers for Parallel Computing ISBN: 9783030727888
LCPC
LCPC
In a convolutional neural network (CNN), the convolution layers typically dominate the execution time. Hardware accelerators have been designed to speed up convolution. One class of accelerators provide hardware support for matrix multiplication (mat
Externí odkaz:
https://explore.openaire.eu/search/publication?articleId=doi_________::19c05ed8390f238b8335e5c74a6c31fa
https://doi.org/10.1007/978-3-030-72789-5_11
https://doi.org/10.1007/978-3-030-72789-5_11
Publikováno v:
WPMVP@PPoPP
Developers often rely on automatic vectorization to speed up fine-grained data-parallel code. However, for loop nests where the loops are shorter than the processor's SIMD width, automatic vectorization performs poorly. Vectorizers attempt to vectori
Autor:
Wen-mei W. Hwu, Nasser Anssari, Geng (Daniel) Liu, John A. Stratton, Nady Obeid, Christopher I. Rodrigues, Li-Wen Chang, I-Jui Sung
Publikováno v:
Computer. 45:26-32
A study of the implementation patterns among massively threaded applications for many-core GPUs reveals that each of the seven most commonly used algorithm and data optimization techniques can enhance the performance of applicable kernels by 2 to 10
Publikováno v:
The Journal of Supercomputing. 64:1008-1020
Dynamic memory allocation is an important feature of modern programming systems. However, the cost of memory allocation in massively parallel execution environments such as CUDA has been too high for many types of kernels. This paper presents XMalloc
Autor:
Xiaohuang Huang, Dennis Lin, Sanjay J. Patel, J. Blackburn, Minh N. Do, Quang Nguyen, Christopher I. Rodrigues, Wen-mei W. Hwu, Thomas S. Huang
Publikováno v:
IEEE Signal Processing Magazine. 26:103-112
In this article, we focus on the applicability of parallel computing architectures to video processing applications. We demonstrate different optimization strategies in detail using the 3-D convolution problem as an example, and show how they affect
Publikováno v:
Computing in Science & Engineering. 11:16-26
Graphics processing units (GPUs) can provide excellent speedups on some, but not all, general-purpose workloads. Using a set of computational GPU kernels as examples, the authors show how to adapt kernels to utilize the architectural features of a Ge
Autor:
Shane Ryoo, Sain-Zee Ueng, Wen-mei W. Hwu, Sara S. Baghsorkhi, Christopher I. Rodrigues, John A. Stratton, Sam S. Stone
Publikováno v:
Journal of Parallel and Distributed Computing. 68:1389-1401
Contemporary many-core processors such as the GeForce 8800 GTX enable application developers to utilize various levels of parallelism to enhance the performance of their applications. However, iterative optimization for such a system may lead to a lo
Publikováno v:
MICRO
With the SIMT execution model, GPUs can hidememory latency through massive multithreading for many applications that have regular memory access patterns. To support applications with irregular memory access patterns, cache hierarchies have been intro
Publikováno v:
PPOPP
Functional algorithmic skeletons promise a high-level programming interface for distributed-memory clusters that free developers from concerns of task decomposition, scheduling, and communication. Unfortunately, prior distributed functional skeleton
Autor:
Nady Obeid, Nasser Anssari, Li-Wen Chang, John A. Stratton, I-Jui Sung, Christopher I. Rodrigues, Geng Daniel Liu, Wen-mei W. Hwu
Publikováno v:
2012 Innovative Parallel Computing (InPar).
It is unquestionable that successive hardware generations have significantly improved GPU computing workload performance over the last several years. Moore's law and DRAM scaling have respectively increased single-chip peak instruction throughput by