Výsledky vyhledávání

Report

HashAttention: Semantic Sparsity for Faster Inference

Autor: Desai, Aditya, Yang, Shuo, Cuadron, Alejandro, Klimovic, Ana, Zaharia, Matei, Gonzalez, Joseph E., Stoica, Ion

Utilizing longer contexts is increasingly essential to power better AI systems. However, the cost of attending to long contexts is high due to the involved softmax computation. While the scaled dot-product attention (SDPA) exhibits token sparsity, wi

Externí odkaz: http://arxiv.org/abs/2412.14468

Zobrazit plný text záznamu

Report

VisionArena: 230K Real World User-VLM Conversations with Preference Labels

Autor: Chou, Christopher, Dunlap, Lisa, Mashita, Koki, Mandal, Krishna, Darrell, Trevor, Stoica, Ion, Gonzalez, Joseph E., Chiang, Wei-Lin

With the growing adoption and capabilities of vision-language models (VLMs) comes the need for benchmarks that capture authentic user-VLM interactions. In response, we create VisionArena, a dataset of 230K real-world conversations between users and V

Externí odkaz: http://arxiv.org/abs/2412.08687

Zobrazit plný text záznamu

Report

GameArena: Evaluating LLM Reasoning through Live Computer Games

Autor: Hu, Lanxiang, Li, Qiyu, Xie, Anze, Jiang, Nan, Stoica, Ion, Jin, Haojian, Zhang, Hao

Evaluating the reasoning abilities of large language models (LLMs) is challenging. Existing benchmarks often depend on static datasets, which are vulnerable to data contamination and may get saturated over time, or on binary live human feedback that

Externí odkaz: http://arxiv.org/abs/2412.06394

Zobrazit plný text záznamu

Report

FogROS2-FT: Fault Tolerant Cloud Robotics

Autor: Chen, Kaiyuan, Hari, Kush, Chung, Trinity, Wang, Michael, Tian, Nan, Juette, Christian, Ichnowski, Jeffrey, Ren, Liu, Kubiatowicz, John, Stoica, Ion, Goldberg, Ken

Cloud robotics enables robots to offload complex computational tasks to cloud servers for performance and ease of management. However, cloud compute can be costly, cloud services can suffer occasional downtime, and connectivity between the robot and

Externí odkaz: http://arxiv.org/abs/2412.05408

Zobrazit plný text záznamu

Report

Specifications: The missing link to making the development of LLM systems an engineering discipline

Autor: Stoica, Ion, Zaharia, Matei, Gonzalez, Joseph, Goldberg, Ken, Sen, Koushik, Zhang, Hao, Angelopoulos, Anastasios, Patil, Shishir G., Chen, Lingjiao, Chiang, Wei-Lin, Davis, Jared Q.

Despite the significant strides made by generative AI in just a few short years, its future progress is constrained by the challenge of building modular and robust systems. This capability has been a cornerstone of past technological revolutions, whi

Externí odkaz: http://arxiv.org/abs/2412.05299

Zobrazit plný text záznamu

Report

BlendServe: Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching

Autor: Zhao, Yilong, Yang, Shuo, Zhu, Kan, Zheng, Lianmin, Kasikci, Baris, Zhou, Yang, Xing, Jiarong, Stoica, Ion

Offline batch inference, which leverages the flexibility of request batching to achieve higher throughput and lower costs, is becoming more popular for latency-insensitive applications. Meanwhile, recent progress in model capability and modality make

Externí odkaz: http://arxiv.org/abs/2411.16102

Zobrazit plný text záznamu

Report

MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs

Autor: Cao, Shiyi, Liu, Shu, Griggs, Tyler, Schafhalter, Peter, Liu, Xiaoxuan, Sheng, Ying, Gonzalez, Joseph E., Zaharia, Matei, Stoica, Ion

Efficient deployment of large language models, particularly Mixture of Experts (MoE), on resource-constrained platforms presents significant challenges, especially in terms of computational efficiency and memory utilization. The MoE architecture, ren

Externí odkaz: http://arxiv.org/abs/2411.11217

Zobrazit plný text záznamu

Report

Pie: Pooling CPU Memory for LLM Inference

Autor: Xu, Yi, Mao, Ziming, Mo, Xiangxi, Liu, Shu, Stoica, Ion

The rapid growth of LLMs has revolutionized natural language processing and AI analysis, but their increasing size and memory demands present significant challenges. A common solution is to spill over to CPU memory; however, traditional GPU-CPU memor

Externí odkaz: http://arxiv.org/abs/2411.09317

Zobrazit plný text záznamu

Report

SkyServe: Serving AI Models across Regions and Clouds with Spot Instances

Autor: Mao, Ziming, Xia, Tian, Wu, Zhanghao, Chiang, Wei-Lin, Griggs, Tyler, Bhardwaj, Romil, Yang, Zongheng, Shenker, Scott, Stoica, Ion

Recent years have witnessed an explosive growth of AI models. The high cost of hosting AI services on GPUs and their demanding service requirements, make it timely and challenging to lower service costs and guarantee service quality. While spot insta

Externí odkaz: http://arxiv.org/abs/2411.01438

Zobrazit plný text záznamu

Report

NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference

Autor: Jiang, Xuanlin, Zhou, Yang, Cao, Shiyi, Stoica, Ion, Yu, Minlan

Online LLM inference powers many exciting applications such as intelligent chatbots and autonomous agents. Modern LLM inference engines widely rely on request batching to improve inference throughput, aiming to make it cost-efficient when running on

Externí odkaz: http://arxiv.org/abs/2411.01142

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání