Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models Paper • 2411.04996 • Published 2 days ago • 33
Unpacking SDXL Turbo: Interpreting Text-to-Image Models with Sparse Autoencoders Paper • 2410.22366 • Published 12 days ago • 71
F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching Paper • 2410.06885 • Published Oct 9 • 40
A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegrained Image Generation Paper • 2410.01912 • Published Oct 2 • 13
Eliminating Oversaturation and Artifacts of High Guidance Scales in Diffusion Models Paper • 2410.02416 • Published Oct 3 • 25
VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models Paper • 2409.17066 • Published Sep 25 • 27
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models Paper • 2409.17146 • Published Sep 25 • 99
Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think Paper • 2409.11355 • Published Sep 17 • 28
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution Paper • 2409.12191 • Published Sep 18 • 73
Moshi v0.1 Release Collection MLX, Candle & PyTorch model checkpoints released as part of the Moshi release from Kyutai. Run inference via: https://github.com/kyutai-labs/moshi • 13 items • Updated Sep 18 • 215
Open-MAGVIT2: An Open-Source Project Toward Democratizing Auto-regressive Visual Generation Paper • 2409.04410 • Published Sep 6 • 23