Performance engineering for HEVC transform and quantization kernel on GPUs

Autor: Igor Piljić, Mate Cobrnic, Mario Kovač, Leon Dragić, Alen Duspara
Jazyk: angličtina
Rok vydání: 2020
Předmět:
0209 industrial biotechnology
General Computer Science
Computer science
matrix multiplication
lcsh:Automation
ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION
lcsh:Control engineering systems. Automatic machinery (General)
High resolution
02 engineering and technology
Integer discrete cosine transform (DCT)
high efficiency video coding (HEVC)
Graphics processor unit (GPU)
compute unified device architecture (CUDA)
high efficiency video coding (hevc)
lcsh:TJ212-225
020901 industrial engineering & automation
0202 electrical engineering
electronic engineering
information engineering

lcsh:T59.5
020208 electrical & electronic engineering
Matrix multiplication
High Efficiency Video Coding (HEVC)
Graphics Processor Unit (GPU)
Compute Unified Device Architecture (CUDA)
graphics processor unit (gpu)
Kernel (image processing)
Computer engineering
compute unified device architecture (cuda)
Control and Systems Engineering
Performance engineering
integer discrete cosine transform (dct)
Coding (social sciences)
Zdroj: Automatika : časopis za automatiku, mjerenje, elektroniku, računarstvo i komunikacije
Volume 61
Issue 3
Automatika, Vol 61, Iss 3, Pp 325-333 (2020)
ISSN: 0005-1144
1848-3380
Popis: Continuous growth of video traffic and video services, especially in the field of high resolution and high-quality video content, places heavy demands on video coding and its implementations. High Efficiency Video Coding (HEVC) standard doubles the compression efficiency of its predecessor H.264/AVC at the cost of high computational complexity. To address those computing issues high-performance video processing takes advantage of heterogeneous multiprocessor platforms. In this paper, we present a highly performance-optimized HEVC transform and quantization kernel with all-zero-block (AZB) identification designed for execution on a Graphics Processor Unit (GPU). Performance optimization strategy involved all three aspects of parallel design, exposing as much of the application’s intrinsic parallelism as possible, exploitation of high throughput memory and efficient instruction usage. It combines efficient mapping of transform blocks to thread-blocks and efficient vectorized access patterns to shared memory for all transform sizes supported in the standard. Two different GPUs of the same architecture were used to evaluate proposed implementation. Achieved processing times are 6.03 and 23.94 ms for DCI 4K and 8K Full Format, respectively. Speedup factors compared to CPU, cuBLAS and AVX2 implementations are up to 80, 19 and 4 times respectively. Proposed implementation outperforms previous work 1.22 times.
Databáze: OpenAIRE