Zobrazeno 1 - 10
of 453
pro vyhledávání: '"P. Whatmough"'
Autor:
Federici, Marco, Belli, Davide, van Baalen, Mart, Jalalirad, Amir, Skliar, Andrii, Major, Bence, Nagel, Markus, Whatmough, Paul
While mobile devices provide ever more compute power, improvements in DRAM bandwidth are much slower. This is unfortunate for large language model (LLM) token generation, which is heavily memory-bound. Previous work has proposed to leverage natural d
Externí odkaz:
http://arxiv.org/abs/2412.01380
Autor:
Skliar, Andrii, van Rozendaal, Ties, Lepert, Romain, Boinovski, Todor, van Baalen, Mart, Nagel, Markus, Whatmough, Paul, Bejnordi, Babak Ehteshami
Mixture of Experts (MoE) LLMs have recently gained attention for their ability to enhance performance by selectively engaging specialized subnetworks or "experts" for each input. However, deploying MoEs on memory-constrained devices remains challengi
Externí odkaz:
http://arxiv.org/abs/2412.00099
Autor:
Bhardwaj, Kartikeya, Pandey, Nilesh Prasad, Priyadarshi, Sweta, Ganapathy, Viswanath, Esteves, Rafael, Kadambi, Shreya, Borse, Shubhankar, Whatmough, Paul, Garrepalli, Risheek, Van Baalen, Mart, Teague, Harris, Nagel, Markus
In this paper, we propose Sparse High Rank Adapters (SHiRA) that directly finetune 1-2% of the base model weights while leaving others unchanged, thus, resulting in a highly sparse adapter. This high sparsity incurs no inference overhead, enables rap
Externí odkaz:
http://arxiv.org/abs/2407.16712
Autor:
Bhardwaj, Kartikeya, Pandey, Nilesh Prasad, Priyadarshi, Sweta, Ganapathy, Viswanath, Esteves, Rafael, Kadambi, Shreya, Borse, Shubhankar, Whatmough, Paul, Garrepalli, Risheek, Van Baalen, Mart, Teague, Harris, Nagel, Markus
Low Rank Adaptation (LoRA) has gained massive attention in the recent generative AI research. One of the main advantages of LoRA is its ability to be fused with pretrained models adding no overhead during inference. However, from a mobile deployment
Externí odkaz:
http://arxiv.org/abs/2406.13175
As Neural Processing Units (NPU) or accelerators are increasingly deployed in a variety of applications including safety critical applications such as autonomous vehicle, and medical imaging, it is critical to understand the fault-tolerance nature of
Externí odkaz:
http://arxiv.org/abs/2404.09317
Autor:
van Baalen, Mart, Kuzmin, Andrey, Nagel, Markus, Couperus, Peter, Bastoul, Cedric, Mahurin, Eric, Blankevoort, Tijmen, Whatmough, Paul
In this work we show that the size versus accuracy trade-off of neural network quantization can be significantly improved by increasing the quantization dimensionality. We propose the GPTVQ method, a new fast method for post-training vector quantizat
Externí odkaz:
http://arxiv.org/abs/2402.15319
Autor:
Chai, Yuji, Tripathy, Devashree, Zhou, Chuteng, Gope, Dibakar, Fedorov, Igor, Matas, Ramon, Brooks, David, Wei, Gu-Yeon, Whatmough, Paul
The ability to accurately predict deep neural network (DNN) inference performance metrics, such as latency, power, and memory footprint, for an arbitrary DNN on a target hardware platform is essential to the design of DNN based models. This ability i
Externí odkaz:
http://arxiv.org/abs/2301.10999
As Deep Neural Networks (DNNs) are increasingly deployed in safety critical and privacy sensitive applications such as autonomous driving and biometric authentication, it is critical to understand the fault-tolerance nature of DNNs. Prior work primar
Externí odkaz:
http://arxiv.org/abs/2212.02649
Autor:
Bhardwaj, Kartikeya, Ward, James, Tung, Caleb, Gope, Dibakar, Meng, Lingchuan, Fedorov, Igor, Chalfin, Alex, Whatmough, Paul, Loh, Danny
Is it possible to restructure the non-linear activation functions in a deep network to create hardware-efficient models? To address this question, we propose a new paradigm called Restructurable Activation Networks (RANs) that manipulate the amount o
Externí odkaz:
http://arxiv.org/abs/2208.08562
Autor:
Fedorov, Igor, Matas, Ramon, Tann, Hokchhay, Zhou, Chuteng, Mattina, Matthew, Whatmough, Paul
Deploying TinyML models on low-cost IoT hardware is very challenging, due to limited device memory capacity. Neural processing unit (NPU) hardware address the memory challenge by using model compression to exploit weight quantization and sparsity to
Externí odkaz:
http://arxiv.org/abs/2201.05842