FA2

Autor: Kamran Razavi, Manisha Luthra, Boris Koldehofe, Max Muhlhauser, Lin Wang
Přispěvatelé: Distributed Systems, Computer Systems, Network Institute
Jazyk: angličtina
Rok vydání: 2022
Předmět:
Zdroj: Razavi, K, Luthra, M, Koldehofe, B, Muhlhauser, M & Wang, L 2022, FA2: Fast, Accurate Autoscaling for Serving Deep Learning Inference with SLA Guarantees . in 2022 IEEE 28th Real-Time and Embedded Technology and Applications Symposium (RTAS) : [Proceedings] . Proceedings of the IEEE Real-Time and Embedded Technology and Applications Symposium, RTAS, vol. 2022-May, Institute of Electrical and Electronics Engineers Inc., pp. 146-159, 28th IEEE Real-Time and Embedded Technology and Applications Symposium, RTAS 2022, Milan, Italy, 4/05/22 . https://doi.org/10.1109/RTAS54340.2022.00020
Proceedings of the 28th IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS 2022), 146-159
STARTPAGE=146;ENDPAGE=159;TITLE=Proceedings of the 28th IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS 2022)
2022 IEEE 28th Real-Time and Embedded Technology and Applications Symposium (RTAS): [Proceedings], 146-159
STARTPAGE=146;ENDPAGE=159;TITLE=2022 IEEE 28th Real-Time and Embedded Technology and Applications Symposium (RTAS)
DOI: 10.1109/RTAS54340.2022.00020
Popis: Deep learning (DL) inference has become an essential building block in modern intelligent applications. Due to the high computational intensity of DL, it is critical to scale DL inference serving systems in response to fluctuating workloads to achieve resource efficiency. Meanwhile, intelligent applicationsoften require strict service level agreements (SLAs), which need to be guaranteed when the system is scaled. The problem is complex and has been tackled only in simple scenarios so far. This paper describes FA2, a fast and accurate autoscalerconcept for DL inference serving systems. In contrast to related works, FA2 adopts a general, contrived two-phase approach. Specifically, it starts by capturing the autoscaling challenges in a comprehensive graph-based model. Then, FA2 applies targeted graph transformation and makes autoscaling decisions with an efficient algorithm based on dynamic programming. We implemented FA2 and built and evaluated a prototype. Compared withstate-of-the-art autoscaling solutions, our experiments showed FA2 to achieve significant resource reduction (19% under CPUs and 25% under GPUs, on average) in combination with low SLA violations (less than 1.5%). FA2 performed close to the theoretical optimum, matching exactly the optimal decisions (with the least required resources) in 96.8% of all the cases in our evaluation.
Databáze: OpenAIRE