Zobrazeno 1 - 10
of 461
pro vyhledávání: '"STOICA, ION"'
Autor:
Desai, Aditya, Yang, Shuo, Cuadron, Alejandro, Klimovic, Ana, Zaharia, Matei, Gonzalez, Joseph E., Stoica, Ion
Utilizing longer contexts is increasingly essential to power better AI systems. However, the cost of attending to long contexts is high due to the involved softmax computation. While the scaled dot-product attention (SDPA) exhibits token sparsity, wi
Externí odkaz:
http://arxiv.org/abs/2412.14468
Autor:
Chou, Christopher, Dunlap, Lisa, Mashita, Koki, Mandal, Krishna, Darrell, Trevor, Stoica, Ion, Gonzalez, Joseph E., Chiang, Wei-Lin
With the growing adoption and capabilities of vision-language models (VLMs) comes the need for benchmarks that capture authentic user-VLM interactions. In response, we create VisionArena, a dataset of 230K real-world conversations between users and V
Externí odkaz:
http://arxiv.org/abs/2412.08687
Evaluating the reasoning abilities of large language models (LLMs) is challenging. Existing benchmarks often depend on static datasets, which are vulnerable to data contamination and may get saturated over time, or on binary live human feedback that
Externí odkaz:
http://arxiv.org/abs/2412.06394
Autor:
Chen, Kaiyuan, Hari, Kush, Chung, Trinity, Wang, Michael, Tian, Nan, Juette, Christian, Ichnowski, Jeffrey, Ren, Liu, Kubiatowicz, John, Stoica, Ion, Goldberg, Ken
Cloud robotics enables robots to offload complex computational tasks to cloud servers for performance and ease of management. However, cloud compute can be costly, cloud services can suffer occasional downtime, and connectivity between the robot and
Externí odkaz:
http://arxiv.org/abs/2412.05408
Autor:
Stoica, Ion, Zaharia, Matei, Gonzalez, Joseph, Goldberg, Ken, Sen, Koushik, Zhang, Hao, Angelopoulos, Anastasios, Patil, Shishir G., Chen, Lingjiao, Chiang, Wei-Lin, Davis, Jared Q.
Despite the significant strides made by generative AI in just a few short years, its future progress is constrained by the challenge of building modular and robust systems. This capability has been a cornerstone of past technological revolutions, whi
Externí odkaz:
http://arxiv.org/abs/2412.05299
Autor:
Zhao, Yilong, Yang, Shuo, Zhu, Kan, Zheng, Lianmin, Kasikci, Baris, Zhou, Yang, Xing, Jiarong, Stoica, Ion
Offline batch inference, which leverages the flexibility of request batching to achieve higher throughput and lower costs, is becoming more popular for latency-insensitive applications. Meanwhile, recent progress in model capability and modality make
Externí odkaz:
http://arxiv.org/abs/2411.16102
Autor:
Cao, Shiyi, Liu, Shu, Griggs, Tyler, Schafhalter, Peter, Liu, Xiaoxuan, Sheng, Ying, Gonzalez, Joseph E., Zaharia, Matei, Stoica, Ion
Efficient deployment of large language models, particularly Mixture of Experts (MoE), on resource-constrained platforms presents significant challenges, especially in terms of computational efficiency and memory utilization. The MoE architecture, ren
Externí odkaz:
http://arxiv.org/abs/2411.11217
The rapid growth of LLMs has revolutionized natural language processing and AI analysis, but their increasing size and memory demands present significant challenges. A common solution is to spill over to CPU memory; however, traditional GPU-CPU memor
Externí odkaz:
http://arxiv.org/abs/2411.09317
Autor:
Mao, Ziming, Xia, Tian, Wu, Zhanghao, Chiang, Wei-Lin, Griggs, Tyler, Bhardwaj, Romil, Yang, Zongheng, Shenker, Scott, Stoica, Ion
Recent years have witnessed an explosive growth of AI models. The high cost of hosting AI services on GPUs and their demanding service requirements, make it timely and challenging to lower service costs and guarantee service quality. While spot insta
Externí odkaz:
http://arxiv.org/abs/2411.01438
Online LLM inference powers many exciting applications such as intelligent chatbots and autonomous agents. Modern LLM inference engines widely rely on request batching to improve inference throughput, aiming to make it cost-efficient when running on
Externí odkaz:
http://arxiv.org/abs/2411.01142