Výsledky vyhledávání

Report

Just read twice: closing the recall gap for recurrent language models

Autor: Arora, Simran, Timalsina, Aman, Singhal, Aaryan, Spector, Benjamin, Eyuboglu, Sabri, Zhao, Xinyi, Rao, Ashish, Rudra, Atri, Ré, Christopher

Recurrent large language models that compete with Transformers in language modeling perplexity are emerging at a rapid rate (e.g., Mamba, RWKV). Excitingly, these architectures use a constant amount of memory during inference. However, due to the lim

Externí odkaz: http://arxiv.org/abs/2407.05483

Zobrazit plný text záznamu

Report

Simple linear attention language models balance the recall-throughput tradeoff

Autor: Arora, Simran, Eyuboglu, Sabri, Zhang, Michael, Timalsina, Aman, Alberti, Silas, Zinsley, Dylan, Zou, James, Rudra, Atri, Ré, Christopher

Recent work has shown that attention-based language models excel at recall, the ability to ground generations in tokens previously seen in context. However, the efficiency of attention-based models is bottle-necked during inference by the KV-cache's

Externí odkaz: http://arxiv.org/abs/2402.18668

Zobrazit plný text záznamu

Report

Zoology: Measuring and Improving Recall in Efficient Language Models

Autor: Arora, Simran, Eyuboglu, Sabri, Timalsina, Aman, Johnson, Isys, Poli, Michael, Zou, James, Rudra, Atri, Ré, Christopher

Attention-free language models that combine gating and convolutions are growing in popularity due to their efficiency and increasingly competitive performance. To better understand these architectures, we pretrain a suite of 17 attention and "gated-c

Externí odkaz: http://arxiv.org/abs/2312.04927

Zobrazit plný text záznamu

Report

Laughing Hyena Distillery: Extracting Compact Recurrences From Convolutions

Autor: Massaroli, Stefano, Poli, Michael, Fu, Daniel Y., Kumbong, Hermann, Parnichkun, Rom N., Timalsina, Aman, Romero, David W., McIntyre, Quinn, Chen, Beidi, Rudra, Atri, Zhang, Ce, Re, Christopher, Ermon, Stefano, Bengio, Yoshua

Recent advances in attention-free sequence models rely on convolutions as alternatives to the attention operator at the core of Transformers. In particular, long convolution sequence models have achieved state-of-the-art performance in many domains,

Externí odkaz: http://arxiv.org/abs/2310.18780

Zobrazit plný text záznamu

Report

Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture

Autor: Fu, Daniel Y., Arora, Simran, Grogan, Jessica, Johnson, Isys, Eyuboglu, Sabri, Thomas, Armin W., Spector, Benjamin, Poli, Michael, Rudra, Atri, Ré, Christopher

Machine learning models are increasingly being scaled in both sequence length and model dimension to reach longer contexts and better performance. However, existing architectures such as Transformers scale quadratically along both these axes. We ask:

Externí odkaz: http://arxiv.org/abs/2310.12109

Zobrazit plný text záznamu

Report

Simple Hardware-Efficient Long Convolutions for Sequence Modeling

Autor: Fu, Daniel Y., Epstein, Elliot L., Nguyen, Eric, Thomas, Armin W., Zhang, Michael, Dao, Tri, Rudra, Atri, Ré, Christopher

State space models (SSMs) have high performance on long sequence modeling but require sophisticated initialization techniques and specialized implementations for high quality and runtime performance. We study whether a simple alternative can match SS

Externí odkaz: http://arxiv.org/abs/2302.06646

Zobrazit plný text záznamu

Report

Hungry Hungry Hippos: Towards Language Modeling with State Space Models

Autor: Fu, Daniel Y., Dao, Tri, Saab, Khaled K., Thomas, Armin W., Rudra, Atri, Ré, Christopher

State space models (SSMs) have demonstrated state-of-the-art sequence modeling performance in some modalities, but underperform attention in language modeling. Moreover, despite scaling nearly linearly in sequence length instead of quadratically, SSM

Externí odkaz: http://arxiv.org/abs/2212.14052

Zobrazit plný text záznamu

Report

Arithmetic Circuits, Structured Matrices and (not so) Deep Learning

Autor: Rudra, Atri

This survey presents a necessarily incomplete (and biased) overview of results at the intersection of arithmetic circuit complexity, structured matrices and deep learning. Recently there has been some research activity in replacing unstructured weigh

Externí odkaz: http://arxiv.org/abs/2206.12490

Zobrazit plný text záznamu

Report

How to Train Your HiPPO: State Space Models with Generalized Orthogonal Basis Projections

Autor: Gu, Albert, Johnson, Isys, Timalsina, Aman, Rudra, Atri, Ré, Christopher

Linear time-invariant state space models (SSM) are a classical model from engineering and statistics, that have recently been shown to be very promising in machine learning through the Structured State Space sequence model (S4). A core component of S

Externí odkaz: http://arxiv.org/abs/2206.12037

Zobrazit plný text záznamu

Report

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Autor: Dao, Tri, Fu, Daniel Y., Ermon, Stefano, Rudra, Atri, Ré, Christopher

Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length. Approximate attention methods have attempted to address this problem by trading off model quality to r

Externí odkaz: http://arxiv.org/abs/2205.14135

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání