Výsledky vyhledávání

Report

Training-Free Activation Sparsity in Large Language Models

Autor: Liu, James, Ponnusamy, Pragaash, Cai, Tianle, Guo, Han, Kim, Yoon, Athiwaratkun, Ben

Activation sparsity can enable practical inference speedups in large language models (LLMs) by reducing the compute and memory-movement required for matrix multiplications during the forward pass. However, existing methods face limitations that inhib

Externí odkaz: http://arxiv.org/abs/2408.14690

Zobrazit plný text záznamu

Report

FlexAttention for Efficient High-Resolution Vision-Language Models

Autor: Li, Junyan, Chen, Delin, Cai, Tianle, Chen, Peihao, Hong, Yining, Chen, Zhenfang, Shen, Yikang, Gan, Chuang

Current high-resolution vision-language models encode images as high-resolution image tokens and exhaustively take all these tokens to compute attention, which significantly increases the computational cost. To address this problem, we propose FlexAt

Externí odkaz: http://arxiv.org/abs/2407.20228

Zobrazit plný text záznamu

Report

Incremental Comprehension of Garden-Path Sentences by Large Language Models: Semantic Interpretation, Syntactic Re-Analysis, and Attention

Autor: Li, Andrew, Feng, Xianle, Narang, Siddhant, Peng, Austin, Cai, Tianle, Shah, Raj Sanjay, Varma, Sashank

When reading temporarily ambiguous garden-path sentences, misinterpretations sometimes linger past the point of disambiguation. This phenomenon has traditionally been studied in psycholinguistic experiments using online measures such as reading times

Externí odkaz: http://arxiv.org/abs/2405.16042

Zobrazit plný text záznamu

Report

SnapKV: LLM Knows What You are Looking for Before Generation

Autor: Li, Yuhong, Huang, Yingbing, Yang, Bowen, Venkitesh, Bharat, Locatelli, Acyr, Ye, Hanchen, Cai, Tianle, Lewis, Patrick, Chen, Deming

Large Language Models (LLMs) have made remarkable progress in processing extensive contexts, with the Key-Value (KV) cache playing a vital role in enhancing their performance. However, the growth of the KV cache in response to increasing input length

Externí odkaz: http://arxiv.org/abs/2404.14469

Zobrazit plný text záznamu

Report

JetMoE: Reaching Llama2 Performance with 0.1M Dollars

Autor: Shen, Yikang, Guo, Zhen, Cai, Tianle, Qin, Zengyi

Large Language Models (LLMs) have achieved remarkable results, but their increasing resource demand has become a major obstacle to the development of powerful and accessible super-human intelligence. This report introduces JetMoE-8B, a new LLM traine

Externí odkaz: http://arxiv.org/abs/2404.07413

Zobrazit plný text záznamu

Report

Accelerating Greedy Coordinate Gradient via Probe Sampling

Autor: Zhao, Yiran, Zheng, Wenyue, Cai, Tianle, Do, Xuan Long, Kawaguchi, Kenji, Goyal, Anirudh, Shieh, Michael

Safety of Large Language Models (LLMs) has become a critical issue given their rapid progresses. Greedy Coordinate Gradient (GCG) is shown to be effective in constructing adversarial prompts to break the aligned LLMs, but optimization of GCG is time-

Externí odkaz: http://arxiv.org/abs/2403.01251

Zobrazit plný text záznamu

Report

DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models

Autor: Li, Muyang, Cai, Tianle, Cao, Jiaxin, Zhang, Qinsheng, Cai, Han, Bai, Junjie, Jia, Yangqing, Liu, Ming-Yu, Li, Kai, Han, Song

Diffusion models have achieved great success in synthesizing high-quality images. However, generating high-resolution images with diffusion models is still challenging due to the enormous computational costs, resulting in a prohibitive latency for in

Externí odkaz: http://arxiv.org/abs/2402.19481

Zobrazit plný text záznamu

Report

BitDelta: Your Fine-Tune May Only Be Worth One Bit

Autor: Liu, James, Xiao, Guangxuan, Li, Kai, Lee, Jason D., Han, Song, Dao, Tri, Cai, Tianle

Large Language Models (LLMs) are typically trained in two phases: pre-training on large internet-scale datasets, and fine-tuning for downstream tasks. Given the higher computational demand of pre-training, it's intuitive to assume that fine-tuning ad

Externí odkaz: http://arxiv.org/abs/2402.10193

Zobrazit plný text záznamu

Report

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

Autor: Cai, Tianle, Li, Yuhong, Geng, Zhengyang, Peng, Hongwu, Lee, Jason D., Chen, Deming, Dao, Tri

Large Language Models (LLMs) employ auto-regressive decoding that requires sequential computation, with each step reliant on the previous one's output. This creates a bottleneck as each step necessitates moving the full model parameters from High-Ban

Externí odkaz: http://arxiv.org/abs/2401.10774

Zobrazit plný text záznamu

Report

REST: Retrieval-Based Speculative Decoding

Autor: He, Zhenyu, Zhong, Zexuan, Cai, Tianle, Lee, Jason D., He, Di

We introduce Retrieval-Based Speculative Decoding (REST), a novel algorithm designed to speed up language model generation. The key insight driving the development of REST is the observation that the process of text generation often includes certain

Externí odkaz: http://arxiv.org/abs/2311.08252

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání