Zobrazeno 1 - 1
of 1
pro vyhledávání: '"Kumar, Abhishek Vijaya"'
Modern model serving engines infer prompts on large language models in batches. While batch processing prompts leads to high inference throughput, it delays responding to requests that do not fit in a batch, potentially starving them. We propose that
Externí odkaz:
http://arxiv.org/abs/2407.21255