35.3 Thread-Level Power Management for a Current- and Temperature-Limiting System in a 7nm Hexagon™ Processor

Autor: Vijay Kiran Kalyanam, Eric Mahurin, Keith Bowman, Suresh K. Venkumahanti
Rok vydání: 2021
Předmět:
Zdroj: ISSCC
Popis: The Hexagon™ compute DSP (CDSP) integrates a master VLIW scalar processor and a slave vector coprocessor to enable high-performance and energy-efficient computing for multimedia, voice, audio, vision, imaging, and machine-learning (ML) applications [1]. The master processor executes scalar instruction packets and issues vector instruction packets to the slave coprocessor. The vector coprocessor executes wide-data arithmetic and memory operations for significant processing at the cost of high power. The power delivery for a mobile system-on-chip (SoC) processor consists of a battery that drives a PMIC to generate the SoC supply voltage $\left(\mathrm{V}_{\mathrm{pD}}\right)$ rails. The PMIC voltage regulator (VR) supplies V op while operating below a peak-current specification (spec). If the CDSP exceeds the peak-current spec for a sustained duration, then the battery and/or PMIC VR may incur a brownout condition where V DD degrades, resulting in circuit failures. Thus, the CDSP requires a current-limiting system to prevent brownout. The latency requirement to detect the current exceeding the peak-current spec and then to respond by operating at a lower current is $\sim 1 \mu \mathrm{s}$. Also, the SoC must operate within a target thermal design power and temperature with detection and response latencies in 100’s of $\mu \mathrm{s}$ and 10’s of ms, respectively. Prior current- and temperature-limiting systems lower the phase-locked loop (PLL) clock frequency (F CLK ) or reduce V DD and $F_{CLK}$, [2] [3] in response to exceeding current or temperature specs. Although these techniques are effective for response latencies above $\sim 10 \mu \mathrm{s}$, the time to reduce the PLL F cLk or V DD and F CLK far exceeds the $1 \mu$ s latency spec. Alternative approaches for satisfying the $1 \mu$ s latency target include integrating an adaptive clocking circuit after the PLL [4] or throttling the instruction-issue rate [5] to quickly change performance. These designs satisfy the latency spec by globally reducing performance without considering individual thread power or priority. This paper describes a thread-level power management (TPM) design to adapt the instruction-issue rate based on individual thread power and priority for a current- and temperature-limiting system in a 7nm [6] Hexagon CDSP. The TPM exploits low-power phases during thread execution to adjust the thread instruction-issue rate to achieve a higher performance at a target power as compared to global throttling.
Databáze: OpenAIRE