Výsledky vyhledávání - "Duanmu, Haojie"

Report

SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models

Autor: Duanmu, Haojie, Yuan, Zhihang, Li, Xiuhong, Duan, Jiangfei, Zhang, Xingcheng, Lin, Dahua

Large language models (LLMs) can now handle longer sequences of tokens, enabling complex tasks like book understanding and generating lengthy novels. However, the key-value (KV) cache required for LLMs consumes substantial memory as context length in

Externí odkaz: http://arxiv.org/abs/2405.06219

Zobrazit plný text záznamu

Report

MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving

Autor: Duan, Jiangfei, Lu, Runyu, Duanmu, Haojie, Li, Xiuhong, Zhang, Xingcheng, Lin, Dahua, Stoica, Ion, Zhang, Hao

Large language models (LLMs) have demonstrated remarkable performance, and organizations are racing to serve LLMs of varying sizes as endpoints for use-cases like chat, programming and search. However, efficiently serving multiple LLMs poses signific

Externí odkaz: http://arxiv.org/abs/2404.02015

Zobrazit plný text záznamu

Report

WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More

Autor: Yue, Yuxuan, Yuan, Zhihang, Duanmu, Haojie, Zhou, Sifan, Wu, Jianlong, Nie, Liqiang

Large Language Models (LLMs) face significant deployment challenges due to their substantial memory requirements and the computational demands of auto-regressive text generation process. This paper addresses these challenges by focusing on the quanti

Externí odkaz: http://arxiv.org/abs/2402.12065

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání