Zobrazeno 1 - 10
of 930
pro vyhledávání: '"KRISHNAMURTHY, ARVIND"'
Autor:
Ye, Zihao, Chen, Lequn, Lai, Ruihang, Lin, Wuwei, Zhang, Yineng, Wang, Stephanie, Chen, Tianqi, Kasikci, Baris, Grover, Vinod, Krishnamurthy, Arvind, Ceze, Luis
Transformers, driven by attention mechanisms, form the foundation of large language models (LLMs). As these models scale up, efficient GPU attention kernels become essential for high-throughput and low-latency inference. Diverse LLM applications dema
Externí odkaz:
http://arxiv.org/abs/2501.01005
Autor:
Zhu, Kan, Zhao, Yilong, Zhao, Liangyu, Zuo, Gefei, Gu, Yile, Xie, Dedong, Gao, Yufei, Xu, Qinyu, Tang, Tian, Ye, Zihao, Kamahori, Keisuke, Lin, Chien-Yu, Wang, Stephanie, Krishnamurthy, Arvind, Kasikci, Baris
The increasing usage of Large Language Models (LLMs) has resulted in a surging demand for planet-scale serving systems, where tens of thousands of GPUs continuously serve hundreds of millions of users. Consequently, throughput (under reasonable laten
Externí odkaz:
http://arxiv.org/abs/2408.12757
Autor:
Xu, Xieyang, Yuan, Yifei, Kincaid, Zachary, Krishnamurthy, Arvind, Mahajan, Ratul, Walker, David, Zhai, Ennan
Relational network verification is a new approach to validating network changes. In contrast to traditional network verification, which analyzes specifications for a single network snapshot, relational network verification analyzes specifications con
Externí odkaz:
http://arxiv.org/abs/2403.17277
Load balancers are pervasively used inside today's clouds to scalably distribute network requests across data center servers. Given the extensive use of load balancers and their associated operating costs, several efforts have focused on improving th
Externí odkaz:
http://arxiv.org/abs/2403.11411
Autor:
Zhao, Liangyu, Maleki, Saeed, Shah, Aashaka, Yang, Ziyue, Pourreza, Hossein, Krishnamurthy, Arvind
As modern DNN models grow ever larger, collective communications between the accelerators (allreduce, etc.) emerge as a significant performance bottleneck. Designing efficient communication schedules is challenging, given today's highly diverse and h
Externí odkaz:
http://arxiv.org/abs/2402.06787
Autor:
Narayan, Akshay, Panda, Aurojit, Alizadeh, Mohammad, Balakrishnan, Hari, Krishnamurthy, Arvind, Shenker, Scott
Reconfiguring the network stack allows applications to specialize the implementations of communication libraries depending on where they run, the requests they serve, and the performance they need to provide. Specializing applications in this way is
Externí odkaz:
http://arxiv.org/abs/2311.07753
Autor:
Zhao, Yilong, Lin, Chien-Yu, Zhu, Kan, Ye, Zihao, Chen, Lequn, Zheng, Size, Ceze, Luis, Krishnamurthy, Arvind, Chen, Tianqi, Kasikci, Baris
The growing demand for Large Language Models (LLMs) in applications such as content generation, intelligent chatbots, and sentiment analysis poses considerable challenges for LLM service providers. To efficiently use GPU resources and boost throughpu
Externí odkaz:
http://arxiv.org/abs/2310.19102
Low-rank adaptation (LoRA) has become an important and popular method to adapt pre-trained models to specific domains. We present Punica, a system to serve multiple LoRA models in a shared GPU cluster. Punica contains a new CUDA kernel design that al
Externí odkaz:
http://arxiv.org/abs/2310.18547
Autor:
Basu, Prithwish, Zhao, Liangyu, Fantl, Jason, Pal, Siddharth, Krishnamurthy, Arvind, Khoury, Joud
The all-to-all collective communications primitive is widely used in machine learning (ML) and high performance computing (HPC) workloads, and optimizing its performance is of interest to both ML and HPC communities. All-to-all is a particularly chal
Externí odkaz:
http://arxiv.org/abs/2309.13541
Secure container runtimes serve as the foundational layer for creating and running containers, which is the bedrock of emerging computing paradigms like microservices and serverless computing. Although existing secure container runtimes indeed enhanc
Externí odkaz:
http://arxiv.org/abs/2309.12624