Výsledky vyhledávání

Report

Characterizing Prompt Compression Methods for Long Context Inference

Autor: Jha, Siddharth, Erdogan, Lutfi Eren, Kim, Sehoon, Keutzer, Kurt, Gholami, Amir

Long context inference presents challenges at the system level with increased compute and memory requirements, as well as from an accuracy perspective in being able to reason over long contexts. Recently, several methods have been proposed to compres

Externí odkaz: http://arxiv.org/abs/2407.08892

Zobrazit plný text záznamu

Report

LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement

Autor: Lee, Nicholas, Wattanawong, Thanakul, Kim, Sehoon, Mangalam, Karttikeya, Shen, Sheng, Anumanchipalli, Gopala, Mahoney, Michael W., Keutzer, Kurt, Gholami, Amir

Pretrained large language models (LLMs) are currently state-of-the-art for solving the vast majority of natural language processing tasks. While many real-world applications still require fine-tuning to reach satisfactory levels of performance, many

Externí odkaz: http://arxiv.org/abs/2403.15042

Zobrazit plný text záznamu

Report

AI and Memory Wall

Autor: Gholami, Amir, Yao, Zhewei, Kim, Sehoon, Hooper, Coleman, Mahoney, Michael W., Keutzer, Kurt

The availability of unprecedented unsupervised training data, along with neural scaling laws, has resulted in an unprecedented surge in model size and compute requirements for serving/training LLMs. However, the main performance bottleneck is increas

Externí odkaz: http://arxiv.org/abs/2403.14123

Zobrazit plný text záznamu

Report

KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization

Autor: Hooper, Coleman, Kim, Sehoon, Mohammadzadeh, Hiva, Mahoney, Michael W., Shao, Yakun Sophia, Keutzer, Kurt, Gholami, Amir

LLMs are seeing growing use for applications such as document analysis and summarization which require large context windows, and with these large context windows KV cache activations surface as the dominant contributor to memory consumption during i

Externí odkaz: http://arxiv.org/abs/2401.18079

Zobrazit plný text záznamu

Report

Learned Best-Effort LLM Serving

Autor: Jha, Siddharth, Hooper, Coleman, Liu, Xiaoxuan, Kim, Sehoon, Keutzer, Kurt

Many applications must provide low-latency LLM service to users or risk unacceptable user experience. However, over-provisioning resources to serve fluctuating request patterns is often prohibitively expensive. In this work, we present a best-effort

Externí odkaz: http://arxiv.org/abs/2401.07886

Zobrazit plný text záznamu

Report

An LLM Compiler for Parallel Function Calling

Autor: Kim, Sehoon, Moon, Suhong, Tabrizi, Ryan, Lee, Nicholas, Mahoney, Michael W., Keutzer, Kurt, Gholami, Amir

The reasoning capabilities of the recent LLMs enable them to execute external function calls to overcome their inherent limitations, such as knowledge cutoffs, poor arithmetic skills, or lack of access to private data. This development has allowed LL

Externí odkaz: http://arxiv.org/abs/2312.04511

Zobrazit plný text záznamu

Report

SPEED: Speculative Pipelined Execution for Efficient Decoding

Autor: Hooper, Coleman, Kim, Sehoon, Mohammadzadeh, Hiva, Genc, Hasan, Keutzer, Kurt, Gholami, Amir, Shao, Sophia

Generative Large Language Models (LLMs) based on the Transformer architecture have recently emerged as a dominant foundation model for a wide range of Natural Language Processing tasks. Nevertheless, their application in real-time scenarios has been

Externí odkaz: http://arxiv.org/abs/2310.12072

Zobrazit plný text záznamu

Report

SqueezeLLM: Dense-and-Sparse Quantization

Autor: Kim, Sehoon, Hooper, Coleman, Gholami, Amir, Dong, Zhen, Li, Xiuyu, Shen, Sheng, Mahoney, Michael W., Keutzer, Kurt

Generative Large Language Models (LLMs) have demonstrated remarkable results for a wide range of tasks. However, deploying these models for inference has been a significant challenge due to their unprecedented resource requirements. This has forced e

Externí odkaz: http://arxiv.org/abs/2306.07629

Zobrazit plný text záznamu

Report

Full Stack Optimization of Transformer Inference: a Survey

Autor: Kim, Sehoon, Hooper, Coleman, Wattanawong, Thanakul, Kang, Minwoo, Yan, Ruohan, Genc, Hasan, Dinh, Grace, Huang, Qijing, Keutzer, Kurt, Mahoney, Michael W., Shao, Yakun Sophia, Gholami, Amir

Publikováno v: Presented in Workshop on Architecture and System Support for Transformer Models (ASSYST) at ISCA 2023

Recent advances in state-of-the-art DNN architecture design have been moving toward Transformer models. These models achieve superior accuracy across a wide range of applications. This trend has been consistent over the past several years since Trans

Externí odkaz: http://arxiv.org/abs/2302.14017

Zobrazit plný text záznamu

Report

Speculative Decoding with Big Little Decoder

Autor: Kim, Sehoon, Mangalam, Karttikeya, Moon, Suhong, Malik, Jitendra, Mahoney, Michael W., Gholami, Amir, Keutzer, Kurt

The recent emergence of Large Language Models based on the Transformer architecture has enabled dramatic advancements in the field of Natural Language Processing. However, these models have long inference latency, which limits their deployment and ma

Externí odkaz: http://arxiv.org/abs/2302.07863

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání