Image / Video Gen - a Norm Collection

Note 1. Introduce v_pred. As for DDPM noise scheduler 1.1 definition: v = \sqrt{\bar{\alpha_t}} \epsilon - \sqrt{1-\bar{\alpha_t}} x_0 1.2 The conversion btw epsilon pred and velocity pred: \epsilon_{pred} = \sqrt{\bar{\alpha_t}} v_{pred} + \sqrt{1-\bar{\alpha_t}} x_t

Flow Matching for Generative Modeling

Paper • 2210.02747 • Published Oct 6, 2022 • 1

simple diffusion: End-to-end diffusion for high resolution images

Paper • 2301.11093 • Published Jan 26, 2023 • 2

Note 1. use (v-prediction, epsilon loss) the loss. v_pred = uvit ( z_t , logsnr_t ) eps_pred = sigma_t * z_t + alpha_t * v_t

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Paper • 2209.03003 • Published Sep 7, 2022 • 1

MAGVIT: Masked Generative Video Transformer

Paper • 2212.05199 • Published Dec 10, 2022

Note 1. Inflation 1.1 Use a central inflation method for the convolution layers, where the corresponding 2D kernel fills in the temporally central slice of a zero-filled 3D kernel. 1.2 Replace the same (zero) padding in the convolution layers with reflect padding,

Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

Paper • 2310.05737 • Published Oct 9, 2023 • 4

Note 1. Known as MAGVIT-2. Growing the vocabulary size can benefit the generation quality. 2. Both reconstruction and generation consistently improve as the vocabulary size increases. Vocab is single-dimensional variables For example, latent feat z \in R^{4} [-1, 1, -2, 3] --> [0, 1, 0, 1] --> sum([0, 2^1, 0, 2^3]) --> 10 [ 1, 1, 1, 3] --> [1, 1, 1, 1] --> sum([2^0, 2^2, 2^2, 2^3]) --> 15

Scalable Diffusion Models with Transformers

Paper • 2212.09748 • Published Dec 19, 2022 • 16

Note 1. Following the U-Net initialization strategy, zero-initializing the final convolutional layer in each block before any residual connections, DiT regresses γ, β, and dimension-wise scaling parameters α that are applied immediately before any residual connections within the DiT block.

SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers

Paper • 2401.08740 • Published Jan 16 • 12

Note 1. Generation Process: (i) Stochastic interpolant framework decouples the formulation of xt from the forward SDE. 2. Model prediction: (i) Learn the velocity field v(x, t) and use it to express the score s(x, t) when using an SDE for sampling. 3. Optimal choice of wt will always be model prediction and interpolant dependent. 4. from a DiT model (discrete, score prediction, VP interpolant) to a SiT model (continuous, velocity prediction, Linear interpolant)

xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations

Paper • 2408.12590 • Published Aug 22 • 33

Note 1. Extend the 2D image-based VAE into a 3D VideoVAE with CausalConv3D. 2. Encode a long video with a divide-and-merge strategy. 3. Caption Model: 3.1 The temporal encoder is implemented with [Token Turing Machines](https://github.com/google-research/scenic/tree/main/scenic/projects/token_turing).

Classifier-Free Diffusion Guidance

Paper • 2207.12598 • Published Jul 26, 2022 • 2

Note 1. Follow-up work: APG(https://arxiv.org/pdf/2410.02416) 1.1 Leaning more on the orthogonal component significantly attenuates this saturation side effect in generations while maintaining the quality-boosting benefits of CFG. 1.2 APG performs best when applied to the denoised predictions rather than the noise prediction.

PixArt-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

Paper • 2310.00426 • Published Sep 30, 2023 • 61

Note 1. Training Receipt - Initialize the T2I model with a low-cost class-condition model; - Pretrain on text-image pair data rich in information density; - Fine-tuning with superior aesthetic quality data; 2. adaLN-single - one global set of shifts and scales is computed only at the first block which is shared across all the blocks, denoted as shared_adaln_cond; - a layer-specific trainable embedding, denoted as adaln_cond; adaptively adjusts the scale and shift parameters in different blocks

FreeInit: Bridging Initialization Gap in Video Diffusion Models

Paper • 2312.07537 • Published Dec 12, 2023 • 26

Note 1. Gap btw training & inference: the initial noises corrupted from real videos remain temporally correlated at the low-frequency band. 2. Free-Init Procedure 2.1 Initialize an independent Gaussian noise; 2.2 DDIM denoising to generate a clean video latent; 2.3 Obtain noisy version video latent through forward diffusion; 2.4 Combine the low-frequency components of this video latent with the high-frequency components from random Gaussian noise; 2.5 Repeat;

black-forest-labs/FLUX.1-schnell

Text-to-Image • Updated Aug 16 • 1.72M • • 2.76k

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Paper • 2403.03206 • Published Mar 5 • 56

Note Known as SD-3 1. Change the distribution over t from the uniform distribution to the one giving more weight to intermediate timesteps by sampling them more frequently. 2. Use a ratio of 50 % original and 50 % synthetic captions. 3. MM-DiT

On the Importance of Noise Scheduling for Diffusion Models

Paper • 2301.10972 • Published Jan 26, 2023 • 1

Note 1. When increasing the image size, the optimal noise scheduling shifts towards a noisier one (due to increased redundancy in pixels). This is more important in video generation.

Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis

Paper • 2402.14797 • Published Feb 22 • 19

Note 1. Argue that treating spatial and temporal modeling in a separable way causes motion artifacts, temporal inconsistencies, or generation of dynamic images rather than videos with vivid motion.

MotionCtrl: A Unified and Flexible Motion Controller for Video Generation

Paper • 2312.03641 • Published Dec 6, 2023 • 20

Note 1. Motion Brush?

Applying Guidance in a Limited Interval Improves Sample and Distribution Quality in Diffusion Models

Paper • 2404.07724 • Published Apr 11 • 12

Note 1. guidance is harmful toward the beginning of the chain (high noise levels), largely unnecessary toward the end (low noise levels), and only beneficial in the middle.

Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

Paper • 2410.06940 • Published 27 days ago • 4

Tora: Trajectory-oriented Diffusion Transformer for Video Generation

Paper • 2407.21705 • Published Jul 31 • 25

MarDini: Masked Autoregressive Diffusion for Video Generation at Scale

Paper • 2410.20280 • Published 10 days ago • 20

Note 1. For Spatio-Temporal Attention, 2D RoPE for spatial & temporal. Insert a learnable [NEXT] token to differentiate image patches across different rows is enough for Spatial. No need for 3D RoPE. 2. Do not include dynamic resolution training in our main training stages. Instead, after convergence, fine-tuning the model for a few steps (10K-20K) with dynamic resolutions enables it.

Finite Scalar Quantization: VQ-VAE Made Simple

Paper • 2309.15505 • Published Sep 27, 2023 • 21

Note 1. Known as FSQ. 2.1 achieve high codebook utilization by design (almost 100%). 2.2 Before FSQ, most of the literature used unbounded scalar quantization, in which the range of integers is not limited by the encoder but only by constraining the representation's entropy. 2.3 vocab size: |C| = L^d 2.4 a simple heuristic that performs well in all considered tasks: Use Li ≥ 5 ∀i.

In-Context LoRA for Diffusion Transformers

Paper • 2410.23775 • Published 6 days ago • 9

Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens

Paper • 2410.13863 • Published 19 days ago • 35

Note 1. validation loss is a proxy for generation quality.