Výsledky vyhledávání - "KRISHNAMURTHY, ARVIND"

Report

FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving

Autor: Ye, Zihao, Chen, Lequn, Lai, Ruihang, Lin, Wuwei, Zhang, Yineng, Wang, Stephanie, Chen, Tianqi, Kasikci, Baris, Grover, Vinod, Krishnamurthy, Arvind, Ceze, Luis

Transformers, driven by attention mechanisms, form the foundation of large language models (LLMs). As these models scale up, efficient GPU attention kernels become essential for high-throughput and low-latency inference. Diverse LLM applications dema

Externí odkaz: http://arxiv.org/abs/2501.01005

Zobrazit plný text záznamu

Report

NanoFlow: Towards Optimal Large Language Model Serving Throughput

Autor: Zhu, Kan, Zhao, Yilong, Zhao, Liangyu, Zuo, Gefei, Gu, Yile, Xie, Dedong, Gao, Yufei, Xu, Qinyu, Tang, Tian, Ye, Zihao, Kamahori, Keisuke, Lin, Chien-Yu, Wang, Stephanie, Krishnamurthy, Arvind, Kasikci, Baris

The increasing usage of Large Language Models (LLMs) has resulted in a surging demand for planet-scale serving systems, where tens of thousands of GPUs continuously serve hundreds of millions of users. Consequently, throughput (under reasonable laten

Externí odkaz: http://arxiv.org/abs/2408.12757

Zobrazit plný text záznamu

Report

Relational Network Verification

Autor: Xu, Xieyang, Yuan, Yifei, Kincaid, Zachary, Krishnamurthy, Arvind, Mahajan, Ratul, Walker, David, Zhai, Ennan

Relational network verification is a new approach to validating network changes. In contrast to traditional network verification, which analyzes specifications for a single network snapshot, relational network verification analyzes specifications con

Externí odkaz: http://arxiv.org/abs/2403.17277

Zobrazit plný text záznamu

Report

Laconic: Streamlined Load Balancers for SmartNICs

Autor: Cui, Tianyi, Zhao, Chenxingyu, Zhang, Wei, Zhang, Kaiyuan, Krishnamurthy, Arvind

Load balancers are pervasively used inside today's clouds to scalably distribute network requests across data center servers. Given the extensive use of load balancers and their associated operating costs, several efforts have focused on improving th

Externí odkaz: http://arxiv.org/abs/2403.11411

Zobrazit plný text záznamu

Report

ForestColl: Efficient Collective Communications on Heterogeneous Network Fabrics

Autor: Zhao, Liangyu, Maleki, Saeed, Shah, Aashaka, Yang, Ziyue, Pourreza, Hossein, Krishnamurthy, Arvind

As modern DNN models grow ever larger, collective communications between the accelerators (allreduce, etc.) emerge as a significant performance bottleneck. Designing efficient communication schedules is challenging, given today's highly diverse and h

Externí odkaz: http://arxiv.org/abs/2402.06787

Zobrazit plný text záznamu

Report

Bringing Reconfigurability to the Network Stack

Autor: Narayan, Akshay, Panda, Aurojit, Alizadeh, Mohammad, Balakrishnan, Hari, Krishnamurthy, Arvind, Shenker, Scott

Reconfiguring the network stack allows applications to specialize the implementations of communication libraries depending on where they run, the requests they serve, and the performance they need to provide. Specializing applications in this way is

Externí odkaz: http://arxiv.org/abs/2311.07753

Zobrazit plný text záznamu

Report

Atom: Low-bit Quantization for Efficient and Accurate LLM Serving

Autor: Zhao, Yilong, Lin, Chien-Yu, Zhu, Kan, Ye, Zihao, Chen, Lequn, Zheng, Size, Ceze, Luis, Krishnamurthy, Arvind, Chen, Tianqi, Kasikci, Baris

The growing demand for Large Language Models (LLMs) in applications such as content generation, intelligent chatbots, and sentiment analysis poses considerable challenges for LLM service providers. To efficiently use GPU resources and boost throughpu

Externí odkaz: http://arxiv.org/abs/2310.19102

Zobrazit plný text záznamu

Report

Punica: Multi-Tenant LoRA Serving

Autor: Chen, Lequn, Ye, Zihao, Wu, Yongji, Zhuo, Danyang, Ceze, Luis, Krishnamurthy, Arvind

Low-rank adaptation (LoRA) has become an important and popular method to adapt pre-trained models to specific domains. We present Punica, a system to serve multiple LoRA models in a shared GPU cluster. Punica contains a new CUDA kernel design that al

Externí odkaz: http://arxiv.org/abs/2310.18547

Zobrazit plný text záznamu

Report

Efficient All-to-All Collective Communication Schedules for Direct-Connect Topologies

Autor: Basu, Prithwish, Zhao, Liangyu, Fantl, Jason, Pal, Siddharth, Krishnamurthy, Arvind, Khoury, Joud

The all-to-all collective communications primitive is widely used in machine learning (ML) and high performance computing (HPC) workloads, and optimizing its performance is of interest to both ML and HPC communities. All-to-all is a particularly chal

Externí odkaz: http://arxiv.org/abs/2309.13541

Zobrazit plný text záznamu

Report

Quark: A High-Performance Secure Container Runtime for Serverless Computing

Autor: Zhao, Chenxingyu, Sun, Yulin, Xiong, Ying, Krishnamurthy, Arvind

Secure container runtimes serve as the foundational layer for creating and running containers, which is the bedrock of emerging computing paradigms like microservices and serverless computing. Although existing secure container runtimes indeed enhanc

Externí odkaz: http://arxiv.org/abs/2309.12624

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání