arxiv:2403.03206

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Published on Mar 5

· Submitted by

akhaliq on Mar 6

#2 Paper of the day

Upvote

Authors:

Patrick Esser ,

Sumith Kulal ,

Andreas Blattmann ,

Rahim Entezari ,

Jonas Müller ,

Harry Saini ,

Yam Levi ,

Dominik Lorenz ,

Axel Sauer ,

Frederic Boesel ,

Dustin Podell ,

Tim Dockhorn ,

Kyle Lacey ,

Alex Goodwin ,

Robin Rombach

Abstract

Diffusion models create data from noise by inverting the forward paths of data towards noise and have emerged as a powerful generative modeling technique for high-dimensional, perceptual data such as images and videos. Rectified flow is a recent generative model formulation that connects data and noise in a straight line. Despite its better theoretical properties and conceptual simplicity, it is not yet decisively established as standard practice. In this work, we improve existing noise sampling techniques for training rectified flow models by biasing them towards perceptually relevant scales. Through a large-scale study, we demonstrate the superior performance of this approach compared to established diffusion formulations for high-resolution text-to-image synthesis. Additionally, we present a novel transformer-based architecture for text-to-image generation that uses separate weights for the two modalities and enables a bidirectional flow of information between image and text tokens, improving text comprehension, typography, and human preference ratings. We demonstrate that this architecture follows predictable scaling trends and correlates lower validation loss to improved text-to-image synthesis as measured by various metrics and human evaluations. Our largest models outperform state-of-the-art models, and we will make our experimental data, code, and model weights publicly available.

View arXiv page View PDF Add to collection

Community

multimodalart

Mar 6

The Stable Diffusion 3 research paper broken down, including some overlooked details! 📝

Model
📏 2 base model variants mentioned: 2B and 8B sizes

📐 New architecture in all abstraction levels:

🔽 UNet; ⬆️ Multimodal Diffusion Transformer, bye cross attention 👋
🆕 Rectified flows for the diffusion process
🧩 Still a Latent Diffusion Model

📄 3 text-encoders: 2 CLIPs, one T5-XXL; plug-and-play: removing the larger one maintains competitiveness

🗃️ Dataset was deduplicated with SSCD which helped with memorization (no more details about the dataset tho)

Variants
🔁 A DPO fine-tuned model showed great improvement in prompt understanding and aesthetics
✏️ An Instruct Edit 2B model was trained, and learned how to do text-replacement

Results
✅ State of the art in automated evals for composition and prompt understanding
✅ Best win rate in human preference evaluation for prompt understanding, aesthetics and typography (missing some details on how many participants and the design of the experiment)