Clockwork: Resource-Efficient Static Scheduling for Multi-Rate Image Processing Applications on FPGAs
Autor: | Dillon Huff, Pat Hanrahan, Steve Dai |
---|---|
Rok vydání: | 2021 |
Předmět: |
Computer science
business.industry Dataflow 020207 software engineering Image processing 02 engineering and technology Deadlock computer.software_genre Stencil 020202 computer hardware & architecture Embedded system Digital image processing 0202 electrical engineering electronic engineering information engineering Overhead (computing) Hardware acceleration Compiler Field-programmable gate array business computer Throughput (business) |
Zdroj: | FPGA FCCM |
Popis: | Image processing applications can benefit tremendously from FPGA acceleration. However, hardware accelerators for these applications look very different from the programs that image processing algorithm designers are accustomed to writing. As a result, many image processing hardware compilers have been designed to generate hardware accelerators from high-level specifications of image processing algorithms. Unfortunately, all of these compilers either exclude crucial access patterns, do not scale to realistic size applications, or rely on a compilation process in which each stage of the application is an independently scheduled module that sends data to its consumers through FIFOs which adds resource and energy overhead while inhibiting synthesis optimizations.In this paper we present a new algorithm for compiling image processing applications, Clockwork, that uses a combination of techniques from polyhedral analysis and synchronous dataflow (SDF) to overcome these limitations. Clockwork compiles the entire application into one flat, statically scheduled module. As a result, accelerators produced by Clockwork have fixed latency, cannot deadlock, and have no resource overhead from inter-stage FIFOs. We show that designs generated by Clockwork achieve on average a 55% reduction in LUTs, a 30% reduction in flip-flops, and a 22% reduction in BRAMs compared to a state-of-the-art stencil compiler at the same throughput, while handling a wider range of access patterns. Clockwork scales to applications with more than 100,000 LUTs. For an application with dozens of stages, Clockwork achieves energy efficiency 260x that of an 8 thread Intel CPU, 17x that of an NVIDIA K80 GPU, and 2.4x that of an NVIDIA V100 GPU. |
Databáze: | OpenAIRE |
Externí odkaz: |