Enhancing LoRA Model Serving Capacity via Adaptive Operator Scheduling for Multi-Tenancy on GPU

Autor: Lingnan Xia, Hua Ma
Jazyk: angličtina
Rok vydání: 2024
Předmět:
Zdroj: IEEE Access, Vol 12, Pp 160441-160449 (2024)
Druh dokumentu: article
ISSN: 2169-3536
DOI: 10.1109/ACCESS.2024.3483250
Popis: Low-Rank Adaptation (LoRA) has garnered increasing attention for effectively fine-tuning large language models (LLMs) with limited resources. Nonetheless, conventional approaches that cater to multiple LoRA models independently lead to redundant computations and suboptimal GPU utilization. This study tackles these inefficiencies by presenting Dynamic Operator Optimization, a sophisticated automated optimization methodology crafted to dynamically enhance the Segmented Gather Matrix-Vector Multiplication (SGMV) operator according to specific contexts. The distinctive design of SGMV facilitates the batching of GPU operations for diverse LoRA models, resulting in a notable enhancement in computational efficiency. The strategy exploits a Search Space Constructor to construct a structured search space, segmenting the program space into overarching structural outlines and intricate implementation particulars to ensure a varied and adaptable operator implementation. Moreover, an Optimization Engine fine-tunes these implementations through evolutionary search driven by a performance estimation cost model. This progressive optimization procedure ensures that SGMV implementations can dynamically adjust to varying scenarios to uphold superior performance. The findings illustrate that our design can elevate throughput by up to 1.46 times in cutting-edge multi-tenant LoRA deployments.
Databáze: Directory of Open Access Journals