Zobrazeno 1 - 10
of 71
pro vyhledávání: '"Miao, Xupeng"'
Autor:
Nie, Xiaonan, Liu, Qibin, Fu, Fangcheng, Zhu, Shenhan, Miao, Xupeng, Li, Xiaoyang, Zhang, Yang, Liu, Shouda, Cui, Bin
Larger transformer models always perform better on various tasks but require more costs to scale up the model size. To efficiently enlarge models, the mixture-of-experts (MoE) architecture is widely adopted, which consists of a gate network and a ser
Externí odkaz:
http://arxiv.org/abs/2411.08446
Autor:
Wang, Yujie, Zhu, Shenhan, Fu, Fangcheng, Miao, Xupeng, Zhang, Jie, Zhu, Juan, Hong, Fan, Li, Yong, Cui, Bin
Recent foundation models are capable of handling multiple machine learning (ML) tasks and multiple data modalities with the unified base model structure and several specialized model components. However, the development of such multi-task (MT) multi-
Externí odkaz:
http://arxiv.org/abs/2409.03365
This paper presents techniques for theoretically and practically efficient and scalable Schr\"odinger-style quantum circuit simulation. Our approach partitions a quantum circuit into a hierarchy of subcircuits and simulates the subcircuits on multi-n
Externí odkaz:
http://arxiv.org/abs/2408.09055
Autor:
Zhang, Hailin, Ji, Xiaodong, Chen, Yilin, Fu, Fangcheng, Miao, Xupeng, Nie, Xiaonan, Chen, Weipeng, Cui, Bin
As the field of Large Language Models (LLMs) continues to evolve, the context length in inference is steadily growing. Key-Value Cache (KVCache), a crucial component in LLM inference, has now become the primary memory bottleneck due to limited GPU me
Externí odkaz:
http://arxiv.org/abs/2407.12820
Autor:
Jeon, Byungsoo, Wu, Mengdi, Cao, Shiyi, Kim, Sunghyun, Park, Sunghyun, Aggarwal, Neeraj, Unger, Colin, Arfeen, Daiyaan, Liao, Peiyuan, Miao, Xupeng, Alizadeh, Mohammad, Ganger, Gregory R., Chen, Tianqi, Jia, Zhihao
Deep neural networks (DNNs) continue to grow rapidly in size, making them infeasible to train on a single device. Pipeline parallelism is commonly used in existing DNN systems to support large-scale DNN training by partitioning a DNN into multiple st
Externí odkaz:
http://arxiv.org/abs/2406.17145
Autor:
Hu, Muyan, Venkatram, Ashwin, Biswas, Shreyashri, Marimuthu, Balamurugan, Hou, Bohan, Oliaro, Gabriele, Wang, Haojie, Zheng, Liyan, Miao, Xupeng, Zhai, Jidong
Publikováno v:
Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems 3 (2024) 755-769
Kernel orchestration is the task of mapping the computation defined in different operators of a deep neural network (DNN) to the execution of GPU kernels on modern hardware platforms. Prior approaches optimize kernel orchestration by greedily applyin
Externí odkaz:
http://arxiv.org/abs/2406.09465
This paper introduces Helix, a distributed system for high-throughput, low-latency large language model (LLM) serving on heterogeneous GPU clusters. A key idea behind Helix is to formulate inference computation of LLMs over heterogeneous GPUs and net
Externí odkaz:
http://arxiv.org/abs/2406.01566
Autor:
Duan, Jiangfei, Song, Ziang, Miao, Xupeng, Xi, Xiaoli, Lin, Dahua, Xu, Harry, Zhang, Minjia, Jia, Zhihao
Deep neural networks (DNNs) are becoming progressively large and costly to train. This paper aims to reduce DNN training costs by leveraging preemptible instances on modern clouds, which can be allocated at a much lower price when idle but may be pre
Externí odkaz:
http://arxiv.org/abs/2403.14097
Parameter-efficient finetuning (PEFT) is a widely used technique to adapt large language models for different tasks. Service providers typically create separate systems for users to perform PEFT model finetuning and inference tasks. This is because e
Externí odkaz:
http://arxiv.org/abs/2402.18789
Autor:
Yuan, Peiwen, Wang, Xinglin, Feng, Shaoxiong, Pan, Boyuan, Li, Yiwei, Wang, Heda, Miao, Xupeng, Li, Kan
Publikováno v:
EACL 2024 main
Generative Retrieval (GR), autoregressively decoding relevant document identifiers given a query, has been shown to perform well under the setting of small-scale corpora. By memorizing the document corpus with model parameters, GR implicitly achieves
Externí odkaz:
http://arxiv.org/abs/2401.10487