Daily Papers - a melisa Collection

Simple linear attention language models balance the recall-throughput tradeoff

Paper • 2402.18668 • Published Feb 28 • 18

Note - Focus: recall-throughput tradeoff in attention-based LMs - Proposed: "Based" combining linear and sliding window attention - Results: Matches Mamba in perplexity; improves recall tasks by 6.22 accuracy points - Efficiency: Developed IO-aware algorithms, achieving 24× higher throughput than FlashAttention-2 for 1.3b models generating 1024 tokens

Linear Transformers with Learnable Kernel Functions are Better In-Context Models

Paper • 2402.10644 • Published Feb 16 • 79

Note - Introduced ReBased model with learnable kernel, outperforms Based in MQAR task and language modeling on Pile dataset. - Incorporates Layer Normalization into the kernel function. - Significant improvement demonstrated across sequence lengths [128, 256, 512, 1024, 2048]. - Attention matrix analysis shows closer resemblance to vanilla attention compared to Based. - Chosen kernel: ReBased ( gamma *norm(x) + beta)^2

Repeat After Me: Transformers are Better than State Space Models at Copying

Paper • 2402.01032 • Published Feb 1 • 22

Note - Transformers outperform GSSMs in copying tasks; fundamental on input context retrieval. - Empirical tests show Transformers' superiority on synthetic, shuffled, and natural language strings, preserving efficiency across varying input lengths. - GSSM struggles with memory-intensive tasks; architecture limits practicality despite potential state space. - Evaluations involve models with ~160M parameters, leveraging positional encoding variations.

Zoology: Measuring and Improving Recall in Efficient Language Models

Paper • 2312.04927 • Published Dec 8, 2023 • 2

Note - Attention-free language models with gating and convolutions are gaining popularity. - Gated-convolution architectures underperform attention models by up to 2.1 perplexity points on the Pile. - 70M parameter attention model outclasses 1.4 billion parameter gated-convolution model on associative recall. - New task multi-query associative recall (Mqar) formulated to close gap. - Convolution-attention hybrids with input-dependent sparse attention patterns can close 97.4% of the gap.

Can Mamba Learn How to Learn? A Comparative Study on In-Context Learning Tasks

Paper • 2402.04248 • Published Feb 6 • 30

Note - SSMs like Mamba and Transformers compared for in-context learning (ICL) capabilities. - Mamba + Transformer hybrid, MambaFormer, outperforms in tasks challenging for either model alone. - Experimented across tasks like sparse parity, vector-valued MQAR; Mamba struggles in retrieval tasks. - MambaFormer showcases best-of-both-worlds in ICL tasks, suggesting hybrid architectures' potential.

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Paper • 2205.14135 • Published May 27, 2022 • 11

Note - FlashAttention introduces IO-aware exact attention, optimizing GPU HBM/SRAM access. - Achieved 15% speedup over MLPerf 1.1 on BERT-large, 3x on GPT-2 (1K seq. length), 2.4x on long-range arena (1K-4K seq. length). - Enabled Transformers for Path-X (16K seq. length, 61.4% accuracy) and Path-256 (64K seq. length, 63.1% accuracy) challenges. - Employs tiling and recomputation for efficiency.

Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture

Paper • 2310.12109 • Published Oct 18, 2023 • 1

Note - Introduced Monarch Mixer (M2) architecture utilizing Monarch matrices for sub-quadratic scaling in sequence length and model dimension. - Achieved matching or surpassing performance with fewer parameters: BERT-base (-27%), BERT-large (-24%), ViT-b (+1% accuracy, half parameters). - Developed causality enforcement strategy enabling causal sequence mixing, applicable to GPT-style models with 0.2 PPL improvement on The PILE.

Lost in the Middle: How Language Models Use Long Contexts

Paper • 2307.03172 • Published Jul 6, 2023 • 36

Note - Language models show U-shaped performance curve in long-context tasks: highest at start/end, drops in middle. - GPT-3.5-Turbo’s multi-document QA drops over 20% when info is mid-context. - Encoder-decoder models robust within training limit, less outside. - Query-aware contextualization improves key-value retrieval, minimal effect on multi-document QA.

Never Lost in the Middle: Improving Large Language Models via Attention Strengthening Question Answering

Paper • 2311.09198 • Published Nov 15, 2023 • 3

Note - "Lost in the middle" issue tackled with ASM QA, boosting LLMs in Multi-doc QA. - Ziya-Reader outperforms SOTA by up to 21.5% in passage retrieval; 13.7% in shuffled settings. - Employs Attention-Strengthening Multi-doc QA (ASM QA) for enhanced focus in long contexts. - Benchmarks: Multi-doc QA, Synthesis Tasks, Summarization on LongBench.

MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts

Paper • 2401.04081 • Published Jan 8 • 71

Note - MoE-Mamba outperforms Mamba and Transformer-MoE; achieves same Mamba performance in 2.2x fewer training steps (Fig. 1) - Demonstrates efficiency gains of combining SSMs with MoE - Scales well with number of experts, best result with 32 experts - Training setup detailed in Table 3, alternative designs explored but not optimal

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

Paper • 2101.03961 • Published Jan 11, 2021 • 14

Note - **Switch Transformer:** Improved MoE model addressing drawbacks of complexity, communication cost, and training instability. - Achieves 7x pre-training speedups and 4x speedup over T5-XXL with trillion parameter models on the Colossal Clean Crawled Corpus. - Demonstrates superior scaling and fine-tuning benefits; significant improvements in multilingual settings across 101 languages. - Achieved through simplified routing, reduced communication, and enhanced training techniques.

Accelerating LLM Inference with Staged Speculative Decoding

Paper • 2308.04623 • Published Aug 8, 2023 • 23

Hydragen: High-Throughput LLM Inference with Shared Prefixes

Paper • 2402.05099 • Published Feb 7 • 18

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

Paper • 2401.06066 • Published Jan 11 • 43

Scaling Laws for Fine-Grained Mixture of Experts

Paper • 2402.07871 • Published Feb 12 • 11

Note - Introduced "granularity" as a hyperparameter; adjusting it enhances MoE model efficiency. - Proposed new scaling laws incorporating granularity, model size, and training tokens. - Showed optimal granularity (G) enhances compute-optimal MoE performance over dense Transformers. - Empirical findings: compute-optimal MoE with 1020 FLOPs achieves equivalent performance to dense Transformer with 20× FLOPs.

LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models

Paper • 2310.05736 • Published Oct 9, 2023 • 4

Mixtral of Experts

Paper • 2401.04088 • Published Jan 8 • 159

Note - Introduced Mixtral 8x7B, SMoE language model; outperforms Llama 2 70B and GPT-3.5 in benchmarks like mathematics, code generation. - Mixtral 8x7B - Instruct surpasses GPT-3.5 Turbo, Claude-2.1 in human benchmarks; achieves 70.6% on MMLU. - Routing analysis shows no expert specialization across domains; high temporal locality in expert assignment.

Mixture of Tokens: Efficient LLMs through Cross-Example Aggregation

Paper • 2310.15961 • Published Oct 24, 2023 • 1

Note - Mixture of Tokens (MoT) addresses MoE challenges: training instability, load imbalance. - MoT: fully-differentiable, token-expert mixing, avoids discrete operations. - Results show training time reduction by 3× vs. vanilla Transformer, promising for larger models. - Experiment: GPT, C4 dataset, 250k steps, significant train step/time reduction noted. - Future focus: MoT to MoE transition, privacy considerations in autoregressive decoding via temperature parameter.

Mixture-of-Depths: Dynamically allocating compute in transformer-based language models

Paper • 2404.02258 • Published Apr 2 • 104

Note - Mixture-of-Depths (MoD) optimizes transformer FLOP allocation per token, improving efficiency. - Dynamic vs. static compute: top-𝑘 routing; "expert-choice" routing for load balance. - MoD matches/bests vanilla transformers in isoFLOP settings; up to 50% fewer FLOPs, 60% faster step times. - Empirical analysis suggests "optimal" at routing every other block, 12.5% capacity. - Optimal model for X FLOPs: train model with 12.5% capacity being X - MoDE - Mixture of Depths Experts

BlackMamba: Mixture of Experts for State-Space Models

Paper • 2402.01771 • Published Feb 1 • 23

Note - BlackMamba: combines Mamba SSM and MoE for linear-complexity generation and fast inference. - Open-source: 340M/1.5B & 630M/2.8B models, trained on 300B tokens. - Outperforms transformer baselines in inference & training FLOPs. - Introduced Sinkhorn algorithm innovation for MoE routing, significantly reducing convergence iterations. - Evaluation: competitive against pretrained LLMs; superior scaling evident in downstream tasks.

Note - Yi model family by 01.AI, extending 6B & 34B pretrained LMs to various applications including chat, vision-language models. - Data-engineering, achieving strong human preference rates on AlpacaEval and Chatbot Arena. - Performance gains attributed to high-quality 3.1 trillion token pretraining corpus & finetuning dataset iteration. - Highlights depth-upscaled models & 200K context extension, showing notable benchmark performance.

sDPO: Don't Use Your Data All at Once

Paper • 2403.19270 • Published Mar 28 • 40

Note - sDPO proposed for LLM alignment, outperforms other models in H4 score (74.31 vs. 72.67 DPO on SOLAR 10.7B). - Employs preference datasets stepwise, using previously aligned models as references. - Demonstrated on datasets like Ultrafeedback Cleaned, OpenOrca; benchmarks include ARC, HellaSWAG, MMLU, TruthfulQA. - Challenges: optimal data segmentation, expanding model scope.

Long Range Arena: A Benchmark for Efficient Transformers

Paper • 2011.04006 • Published Nov 8, 2020

Note - Focus on quadratic self-attention complexity in Transformers - Introduced Long-Range Arena benchmark for evaluating efficient Transformers under long-context scenarios - Tasks include ListOps, document classification/retrieval, image classification, and pathfinder with sequences 1K-16K tokens - Extensive comparison of ten models; BigBird shows consistent performance across tasks - No "one-size-fits-all"; trade-offs in model quality, speed/memory

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

Paper • 2304.01373 • Published Apr 3, 2023 • 8

Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models

Paper • 2402.19427 • Published Feb 29 • 52

Effective Long-Context Scaling of Foundation Models

Paper • 2309.16039 • Published Sep 27, 2023 • 30

Note tl;dr: - increasing the frequency in RoPE from 10k to 1mln+ - used by Yi for 200k context window (they used 10mln)

Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

Paper • 2404.07143 • Published Apr 10 • 103

Note 1. Infini-attention = linear long-term compressive memory and local causal attention for efficiently modeling both long and short-range contextual dependencies. 2. Minimal change to the standard scaled dot-product attention and supports plug-and-play continual pre-training and long-context adaptation by design. 3. Infinitely long context with a bounded memory - streaming

GLU Variants Improve Transformer

Paper • 2002.05202 • Published Feb 12, 2020 • 1

Note ablation study of various GLU variants GeGLU winning

Thinking Like Transformers

Paper • 2106.06981 • Published Jun 13, 2021

HGRN2: Gated Linear RNNs with State Expansion

Paper • 2404.07904 • Published Apr 11 • 17