Architecture supported register stash for GPGPU
Autor: | Yulong Pei, Tianzhou Chen, Licheng Yu, Minghui Wu |
---|---|
Rok vydání: | 2016 |
Předmět: |
Speedup
Computer Networks and Communications Computer science Register file Multiprocessing 02 engineering and technology Thread (computing) Parallel computing computer.software_genre 01 natural sciences Theoretical Computer Science Artificial Intelligence Control register 0103 physical sciences 0202 electrical engineering electronic engineering information engineering Memory type range register Hardware_REGISTER-TRANSFER-LEVELIMPLEMENTATION 010302 applied physics Processor register FLAGS register Register renaming 020202 computer hardware & architecture Hardware and Architecture Status register Operating system Compiler Memory data register computer Software Register allocation |
Zdroj: | Journal of Parallel and Distributed Computing. 89:25-36 |
ISSN: | 0743-7315 |
Popis: | GPGPU provides abundant hardware resources to support a large number of light-weighted threads. They are organized into blocks and run in warps. All threads of a block must be dispatched to one stream multiprocessor (SM) of GPGPU together. When the remaining resources of an SM cannot support one more block, all threads of the block are held back until former blocks retire from the SM. We found that the register file is prone to be the most limited one among all the resources, especially for SMs with less registers. Meanwhile, we revealed the dynamics of a thread's register requirement: only part of its pre-allocated registers are used for different instructions at run time. This results in considerable register underutilization.We proposed the architecture supported register stash (ASRS). It removes the limitation of registers when dispatching blocks. The hardware registers are allocated at run time according to each instruction's live registers, which can be analyzed statically by a compiler. When the hardware registers cannot meet the requirements of all running warps, some warps are suspended and their registers are reclaimed temporarily. The data in these registers are stashed to memory. On the other hand, if there are spare hardware registers, it will start a new warp or resume a suspended warp after all the warp's stashed register data are loaded from memory. The intra-block synchronization is also taken care of when some of the warps of the same block are not schedulable due to the ASRS.The ASRS alleviates the register underutilization and improves performance without modifying the current programming model or demanding extra effort from the programmers. It also enables an SM with limited registers that cannot even support a single block to execute it. Besides, it helps lower the register file energy consumption and increase the power efficiency. The ASRS achieved speedups of 1.59 and 1.14 when the registers of each SM are limited to 8K and 16K respectively with an insignificant overhead. The speedups compared with the infinite register files are 0.84 and 0.98 with 8K and 16K registers respectively. Compared with the baseline 32K register file, the ASRS decreases the 8K and 16K register file energy consumption to 66.5% and 75.8% respectively. Their power efficiencies (in ratio of performance and power) are increased to 1.29x?and 1.31x?respectively. Register requirement of GPGPU varies among different kernels and during run time.Register file (RF) capacity limits the schedulable warps and performance.Reducing RF capacity lowers energy consumption and area.We proposed a method to support more warps with limited registers.It gains significant speedup and a higher energy efficiency with a smaller RF. |
Databáze: | OpenAIRE |
Externí odkaz: |