Zobrazeno 1 - 10
of 2 305
pro vyhledávání: '"Oliaro, A."'
We present SuffixDecoding, a novel model-free approach to accelerating large language model (LLM) inference through speculative decoding. Unlike existing methods that rely on draft models or specialized decoding heads, SuffixDecoding leverages suffix
Externí odkaz:
http://arxiv.org/abs/2411.04975
We characterize, using time-frequency analysis, the continuity and compactness of the Weyl operator in global classes of ultradifferentiable functions $\mathcal{S}_\omega$, for weight functions $\omega$ in the sense of Braun, Meise and Taylor. As a c
Externí odkaz:
http://arxiv.org/abs/2407.14990
Autor:
Hu, Muyan, Venkatram, Ashwin, Biswas, Shreyashri, Marimuthu, Balamurugan, Hou, Bohan, Oliaro, Gabriele, Wang, Haojie, Zheng, Liyan, Miao, Xupeng, Zhai, Jidong
Publikováno v:
Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems 3 (2024) 755-769
Kernel orchestration is the task of mapping the computation defined in different operators of a deep neural network (DNN) to the execution of GPU kernels on modern hardware platforms. Prior approaches optimize kernel orchestration by greedily applyin
Externí odkaz:
http://arxiv.org/abs/2406.09465
The usefulness of time-frequency analysis methods in the study of quasicrystals was pointed out in a previous paper, where we proved that a tempered distribution $\mu$ on ${\mathbb R}^d$ whose Wigner transform is a measure supported on the cartesian
Externí odkaz:
http://arxiv.org/abs/2405.01907
Parameter-efficient finetuning (PEFT) is a widely used technique to adapt large language models for different tasks. Service providers typically create separate systems for users to perform PEFT model finetuning and inference tasks. This is because e
Externí odkaz:
http://arxiv.org/abs/2402.18789
In this paper we give different estimates between Lebesgue norms of quadratic time-frequency representations. We show that, in some cases, it is not possible to have such bounds in classical $L^p$ spaces, but the Lebesgue norm needs to be suitably we
Externí odkaz:
http://arxiv.org/abs/2402.17578
Publikováno v:
Mediterr. J. Math. 21, art. no. 153, 2024
We study and characterize the inclusion relations of global classes in the general weight matrix framework in terms of growth relations for the defining weight matrices. We consider the Roumieu and Beurling cases, and as a particular case we also tre
Externí odkaz:
http://arxiv.org/abs/2401.11251
We give a simple construction of the log-convex minorant of a sequence $\{M_\alpha\}_{\alpha\in\mathbb{N}_0^d}$ and consequently extend to the $d$-dimensional case the well-known formula that relates a log-convex sequence $\{M_p\}_{p\in\mathbb{N}_0}$
Externí odkaz:
http://arxiv.org/abs/2401.11245
Autor:
Miao, Xupeng, Oliaro, Gabriele, Zhang, Zhihao, Cheng, Xinhao, Jin, Hongyi, Chen, Tianqi, Jia, Zhihao
In the rapidly evolving landscape of artificial intelligence (AI), generative large language models (LLMs) stand at the forefront, revolutionizing how we interact with our data. However, the computational intensity and memory consumption of deploying
Externí odkaz:
http://arxiv.org/abs/2312.15234
Autor:
Miao, Xupeng, Oliaro, Gabriele, Zhang, Zhihao, Cheng, Xinhao, Wang, Zeyu, Zhang, Zhengxin, Wong, Rae Ying Yee, Zhu, Alan, Yang, Lijie, Shi, Xiaoxiang, Shi, Chunan, Chen, Zhuoming, Arfeen, Daiyaan, Abhyankar, Reyna, Jia, Zhihao
This paper introduces SpecInfer, a system that accelerates generative large language model (LLM) serving with tree-based speculative inference and verification. The key idea behind SpecInfer is leveraging small speculative models to predict the LLM's
Externí odkaz:
http://arxiv.org/abs/2305.09781