Zobrazeno 1 - 10
of 805
pro vyhledávání: '"Kim Sehoon"'
Autor:
Moon, Suhong, Jha, Siddharth, Erdogan, Lutfi Eren, Kim, Sehoon, Lim, Woosang, Keutzer, Kurt, Gholami, Amir
Recent advancements in function calling and tool use have significantly enhanced the capabilities of large language models (LLMs) by enabling them to interact with external information sources and execute complex tasks. However, the limited context w
Externí odkaz:
http://arxiv.org/abs/2409.02141
Autor:
Erdogan, Lutfi Eren, Lee, Nicholas, Jha, Siddharth, Kim, Sehoon, Tabrizi, Ryan, Moon, Suhong, Hooper, Coleman, Anumanchipalli, Gopala, Keutzer, Kurt, Gholami, Amir
Recent large language models (LLMs) have enabled the development of advanced agentic systems that can integrate various tools and APIs to fulfill user queries through function calling. However, the deployment of these LLMs on the edge has not been ex
Externí odkaz:
http://arxiv.org/abs/2409.00608
Long context inference presents challenges at the system level with increased compute and memory requirements, as well as from an accuracy perspective in being able to reason over long contexts. Recently, several methods have been proposed to compres
Externí odkaz:
http://arxiv.org/abs/2407.08892
Autor:
Lee, Nicholas, Wattanawong, Thanakul, Kim, Sehoon, Mangalam, Karttikeya, Shen, Sheng, Anumanchipalli, Gopala, Mahoney, Michael W., Keutzer, Kurt, Gholami, Amir
Pretrained large language models (LLMs) are currently state-of-the-art for solving the vast majority of natural language processing tasks. While many real-world applications still require fine-tuning to reach satisfactory levels of performance, many
Externí odkaz:
http://arxiv.org/abs/2403.15042
The availability of unprecedented unsupervised training data, along with neural scaling laws, has resulted in an unprecedented surge in model size and compute requirements for serving/training LLMs. However, the main performance bottleneck is increas
Externí odkaz:
http://arxiv.org/abs/2403.14123
Autor:
Hooper, Coleman, Kim, Sehoon, Mohammadzadeh, Hiva, Mahoney, Michael W., Shao, Yakun Sophia, Keutzer, Kurt, Gholami, Amir
LLMs are seeing growing use for applications such as document analysis and summarization which require large context windows, and with these large context windows KV cache activations surface as the dominant contributor to memory consumption during i
Externí odkaz:
http://arxiv.org/abs/2401.18079
Many applications must provide low-latency LLM service to users or risk unacceptable user experience. However, over-provisioning resources to serve fluctuating request patterns is often prohibitively expensive. In this work, we present a best-effort
Externí odkaz:
http://arxiv.org/abs/2401.07886
Autor:
Kim, Sehoon, Moon, Suhong, Tabrizi, Ryan, Lee, Nicholas, Mahoney, Michael W., Keutzer, Kurt, Gholami, Amir
The reasoning capabilities of the recent LLMs enable them to execute external function calls to overcome their inherent limitations, such as knowledge cutoffs, poor arithmetic skills, or lack of access to private data. This development has allowed LL
Externí odkaz:
http://arxiv.org/abs/2312.04511
Autor:
Hooper, Coleman, Kim, Sehoon, Mohammadzadeh, Hiva, Genc, Hasan, Keutzer, Kurt, Gholami, Amir, Shao, Sophia
Generative Large Language Models (LLMs) based on the Transformer architecture have recently emerged as a dominant foundation model for a wide range of Natural Language Processing tasks. Nevertheless, their application in real-time scenarios has been
Externí odkaz:
http://arxiv.org/abs/2310.12072
Autor:
Kim, Sehoon, Hooper, Coleman, Gholami, Amir, Dong, Zhen, Li, Xiuyu, Shen, Sheng, Mahoney, Michael W., Keutzer, Kurt
Generative Large Language Models (LLMs) have demonstrated remarkable results for a wide range of tasks. However, deploying these models for inference has been a significant challenge due to their unprecedented resource requirements. This has forced e
Externí odkaz:
http://arxiv.org/abs/2306.07629