-
xLSTM: Extended Long Short-Term Memory
Paper • 2405.04517 • Published • 11 -
You Only Cache Once: Decoder-Decoder Architectures for Language Models
Paper • 2405.05254 • Published • 9 -
Understanding the performance gap between online and offline alignment algorithms
Paper • 2405.08448 • Published • 14 -
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Paper • 2405.09818 • Published • 126
Collections
Discover the best community collections!
Collections including paper arxiv:2404.14994
-
Latent Positional Information is in the Self-Attention Variance of Transformer Language Models Without Positional Embeddings
Paper • 2305.13571 • Published • 2 -
Transformers Can Represent n-gram Language Models
Paper • 2404.14994 • Published • 18 -
Are Sixteen Heads Really Better than One?
Paper • 1905.10650 • Published • 2 -
Reasoning in Large Language Models: A Geometric Perspective
Paper • 2407.02678 • Published • 1
-
MLP Can Be A Good Transformer Learner
Paper • 2404.05657 • Published • 1 -
Toward a Better Understanding of Fourier Neural Operators: Analysis and Improvement from a Spectral Perspective
Paper • 2404.07200 • Published • 1 -
An inclusive review on deep learning techniques and their scope in handwriting recognition
Paper • 2404.08011 • Published • 1 -
Long-form music generation with latent diffusion
Paper • 2404.10301 • Published • 24
-
I-Design: Personalized LLM Interior Designer
Paper • 2404.02838 • Published • 2 -
Scaling MLPs: A Tale of Inductive Bias
Paper • 2306.13575 • Published • 14 -
Fast Feedforward Networks
Paper • 2308.14711 • Published • 2 -
How Good Are Low-bit Quantized LLaMA3 Models? An Empirical Study
Paper • 2404.14047 • Published • 44
-
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Paper • 2403.09611 • Published • 124 -
Evolutionary Optimization of Model Merging Recipes
Paper • 2403.13187 • Published • 50 -
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model
Paper • 2402.03766 • Published • 12 -
LLM Agent Operating System
Paper • 2403.16971 • Published • 65
-
Linear Transformers with Learnable Kernel Functions are Better In-Context Models
Paper • 2402.10644 • Published • 78 -
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
Paper • 2305.13245 • Published • 5 -
ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition
Paper • 2402.15220 • Published • 19 -
Sequence Parallelism: Long Sequence Training from System Perspective
Paper • 2105.13120 • Published • 5