Simple linear attention language models balance the recall-throughput tradeoff
Abstract
Recent work has shown that attention-based language models excel at recall, the ability to ground generations in tokens previously seen in context. However, the efficiency of attention-based models is bottle-necked during inference by the KV-cache's aggressive memory consumption. In this work, we explore whether we can improve language model efficiency (e.g. by reducing memory consumption) without compromising on recall. By applying experiments and theory to a broad set of architectures, we identify a key tradeoff between a model's state size and recall ability. We show that efficient alternatives to attention (e.g. H3, Mamba, RWKV) maintain a fixed-size recurrent state, but struggle at recall. We propose BASED a simple architecture combining linear and sliding window attention. By varying BASED window size and linear attention feature dimension, we can dial the state size and traverse the pareto frontier of the recall-memory tradeoff curve, recovering the full quality of attention on one end and the small state size of attention-alternatives on the other. We train language models up to 1.3b parameters and show that BASED matches the strongest sub-quadratic models (e.g. Mamba) in perplexity and outperforms them on real-world recall-intensive tasks by 6.22 accuracy points. Implementations of linear attention are often less efficient than optimized standard attention implementations. To make BASED competitive, we develop IO-aware algorithms that enable 24x higher throughput on language generation than FlashAttention-2, when generating 1024 tokens using 1.3b parameter models. Code for this work is provided at: https://github.com/HazyResearch/based.
Community
Could it be that such an attention mechanism mostly works, because instruction following GPTs use attention as a redundant help pattern for their feed-forward nets?
A visualization of this pattern: https://github.com/jessevig/bertviz/issues/128
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- On the Efficacy of Eviction Policy for Key-Value Constrained Generative Language Model Inference (2024)
- Scaling Sparse Fine-Tuning to Large Language Models (2024)
- Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models (2024)
- Why Lift so Heavy? Slimming Large Language Models by Cutting Off the Layers (2024)
- Long-Context Language Modeling with Parallel Context Encoding (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Hello my friend, I read this paper and it was really great. Can I ask for your help to finish my papaer?
Hi folks, I see the "attention" pattern is 5:5:17 or 7:7:22 for global-linear:64-SWA:BaseConv layers. How are these different layers organized together? Are they stacked (global:SWA:Conv, or another permutation?), are they interleaved? Is the optimal pattern of global/local/conv analyzed?
Hi! The layer mixtures and orders are specified in the reference configs provided here: https://github.com/HazyResearch/based/blob/e2834d89d1b23d4b3beb13389881b84601a95db6/train/configs/experiment/reference/based-360m.yaml#L53 They are stacked layers
Models citing this paper 10
Browse 10 models citing this paperDatasets citing this paper 0
No dataset linking this paper