Zobrazeno 1 - 1
of 1
pro vyhledávání: '"Kim, Hongbeen"'
Recent large language models (LLMs) with enormous model sizes use many GPUs to meet memory capacity requirements incurring substantial costs for token generation. To provide cost-effective LLM inference with relaxed latency constraints, extensive res
Externí odkaz:
http://arxiv.org/abs/2501.01792