Výsledky vyhledávání - "P. Whatmough"

Report

Efficient LLM Inference using Dynamic Input Pruning and Cache-Aware Masking

Autor: Federici, Marco, Belli, Davide, van Baalen, Mart, Jalalirad, Amir, Skliar, Andrii, Major, Bence, Nagel, Markus, Whatmough, Paul

While mobile devices provide ever more compute power, improvements in DRAM bandwidth are much slower. This is unfortunate for large language model (LLM) token generation, which is heavily memory-bound. Previous work has proposed to leverage natural d

Externí odkaz: http://arxiv.org/abs/2412.01380

Zobrazit plný text záznamu

Report

Mixture of Cache-Conditional Experts for Efficient Mobile Device Inference

Autor: Skliar, Andrii, van Rozendaal, Ties, Lepert, Romain, Boinovski, Todor, van Baalen, Mart, Nagel, Markus, Whatmough, Paul, Bejnordi, Babak Ehteshami

Mixture of Experts (MoE) LLMs have recently gained attention for their ability to enhance performance by selectively engaging specialized subnetworks or "experts" for each input. However, deploying MoEs on memory-constrained devices remains challengi

Externí odkaz: http://arxiv.org/abs/2412.00099

Zobrazit plný text záznamu

Report

Rapid Switching and Multi-Adapter Fusion via Sparse High Rank Adapters

Autor: Bhardwaj, Kartikeya, Pandey, Nilesh Prasad, Priyadarshi, Sweta, Ganapathy, Viswanath, Esteves, Rafael, Kadambi, Shreya, Borse, Shubhankar, Whatmough, Paul, Garrepalli, Risheek, Van Baalen, Mart, Teague, Harris, Nagel, Markus

In this paper, we propose Sparse High Rank Adapters (SHiRA) that directly finetune 1-2% of the base model weights while leaving others unchanged, thus, resulting in a highly sparse adapter. This high sparsity incurs no inference overhead, enables rap

Externí odkaz: http://arxiv.org/abs/2407.16712

Zobrazit plný text záznamu

Report

Sparse High Rank Adapters

Low Rank Adaptation (LoRA) has gained massive attention in the recent generative AI research. One of the main advantages of LoRA is its ability to be fused with pretrained models adding no overhead during inference. However, from a mobile deployment

Externí odkaz: http://arxiv.org/abs/2406.13175

Zobrazit plný text záznamu

Report

Characterizing Soft-Error Resiliency in Arm's Ethos-U55 Embedded Machine Learning Accelerator

Autor: Tyagi, Abhishek, Jeyapaul, Reiley, Zhu, Chuteng, Whatmough, Paul, Zhu, Yuhao

As Neural Processing Units (NPU) or accelerators are increasingly deployed in a variety of applications including safety critical applications such as autonomous vehicle, and medical imaging, it is critical to understand the fault-tolerance nature of

Externí odkaz: http://arxiv.org/abs/2404.09317

Zobrazit plný text záznamu

Report

GPTVQ: The Blessing of Dimensionality for LLM Quantization

Autor: van Baalen, Mart, Kuzmin, Andrey, Nagel, Markus, Couperus, Peter, Bastoul, Cedric, Mahurin, Eric, Blankevoort, Tijmen, Whatmough, Paul

In this work we show that the size versus accuracy trade-off of neural network quantization can be significantly improved by increasing the quantization dimensionality. We propose the GPTVQ method, a new fast method for post-training vector quantizat

Externí odkaz: http://arxiv.org/abs/2402.15319

Zobrazit plný text záznamu

Report

PerfSAGE: Generalized Inference Performance Predictor for Arbitrary Deep Learning Models on Edge Devices

Autor: Chai, Yuji, Tripathy, Devashree, Zhou, Chuteng, Gope, Dibakar, Fedorov, Igor, Matas, Ramon, Brooks, David, Wei, Gu-Yeon, Whatmough, Paul

The ability to accurately predict deep neural network (DNN) inference performance metrics, such as latency, power, and memory footprint, for an arbitrary DNN on a target hardware platform is essential to the design of DNN based models. This ability i

Externí odkaz: http://arxiv.org/abs/2301.10999

Zobrazit plný text záznamu

Report

Thales: Formulating and Estimating Architectural Vulnerability Factors for DNN Accelerators

Autor: Tyagi, Abhishek, Gan, Yiming, Liu, Shaoshan, Yu, Bo, Whatmough, Paul, Zhu, Yuhao

As Deep Neural Networks (DNNs) are increasingly deployed in safety critical and privacy sensitive applications such as autonomous driving and biometric authentication, it is critical to understand the fault-tolerance nature of DNNs. Prior work primar

Externí odkaz: http://arxiv.org/abs/2212.02649

Zobrazit plný text záznamu

Report

Restructurable Activation Networks

Autor: Bhardwaj, Kartikeya, Ward, James, Tung, Caleb, Gope, Dibakar, Meng, Lingchuan, Fedorov, Igor, Chalfin, Alex, Whatmough, Paul, Loh, Danny

Is it possible to restructure the non-linear activation functions in a deep network to create hardware-efficient models? To address this question, we propose a new paradigm called Restructurable Activation Networks (RANs) that manipulate the amount o

Externí odkaz: http://arxiv.org/abs/2208.08562

Zobrazit plný text záznamu

Report

UDC: Unified DNAS for Compressible TinyML Models

Autor: Fedorov, Igor, Matas, Ramon, Tann, Hokchhay, Zhou, Chuteng, Mattina, Matthew, Whatmough, Paul

Deploying TinyML models on low-cost IoT hardware is very challenging, due to limited device memory capacity. Neural processing unit (NPU) hardware address the memory challenge by using model compression to exploit weight quantization and sparsity to

Externí odkaz: http://arxiv.org/abs/2201.05842

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání