Zobrazeno 1 - 1
of 1
pro vyhledávání: '"Sankaralingam, Ananth"'
Autor:
Liu, Minghui, Rabbani, Tahseen, O'Halloran, Tony, Sankaralingam, Ananth, Hartley, Mary-Anne, Gravelle, Brian, Huang, Furong, Fermüller, Cornelia, Aloimonos, Yiannis
Transformer-based large language models (LLMs) use the key-value (KV) cache to significantly accelerate inference by storing the key and value embeddings of past tokens. However, this cache consumes significant GPU memory. In this work, we introduce
Externí odkaz:
http://arxiv.org/abs/2412.16187