Deeploy: Enabling Energy-Efficient Deployment of Small Language Models on Heterogeneous Microcontrollers

Autor:	Scherer, Moritz, Macan, Luka, Jung, Victor J. B., Wiese, Philip, Bompani, Luca, Burrello, Alessio, Conti, Francesco, Benini, Luca
Zdroj:	IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems; November 2024, Vol. 43 Issue: 11 p4009-4020, 12p
Abstrakt:	With the rise of embodied foundation models (EFMs), most notably small language models (SLMs), adapting Transformers for the edge applications has become a very active field of research. However, achieving the end-to-end deployment of SLMs on the microcontroller (MCU)-class chips without high-bandwidth off-chip main memory access is still an open challenge. In this article, we demonstrate high efficiency end-to-end SLM deployment on a multicore RISC-V (RV32) MCU augmented with ML instruction extensions and a hardware neural processing unit (NPU). To automate the exploration of the constrained, multidimensional memory versus computation tradeoffs involved in the aggressive SLM deployment on the heterogeneous (multicore+NPU) resources, we introduce Deeploy, a novel deep neural network (DNN) compiler, which generates highly optimized C code requiring minimal runtime support. We demonstrate that Deeploy generates the end-to-end code for executing SLMs, fully exploiting the RV32 cores’ instruction extensions and the NPU. We achieve leading-edge energy and throughput of $490 \; \mu $ J per token, at 340 token per second for an SLM trained on the TinyStories dataset, running for the first time on an MCU-class device without the external memory.
Databáze:	Supplemental Index
Externí odkaz:	Zobrazit plný text záznamu