Diffusers

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

T2I-Adapter

T2I-Adapter is a lightweight adapter model that provides an additional conditioning input image (line art, canny, sketch, depth, pose) to better control image generation. It is similar to a ControlNet, but it is a lot smaller (~77M parameters and ~300MB file size) because its only inserts weights into the UNet instead of copying and training it.

The T2I-Adapter is only available for training with the Stable Diffusion XL (SDXL) model.

This guide will explore the train_t2i_adapter_sdxl.py training script to help you become familiar with it, and how you can adapt it for your own use-case.

Before running the script, make sure you install the library from source:

git clone https://github.com/huggingface/diffusers
cd diffusers
pip install .

Then navigate to the example folder containing the training script and install the required dependencies for the script you’re using:

cd examples/t2i_adapter
pip install -r requirements.txt

🤗 Accelerate is a library for helping you train on multiple GPUs/TPUs or with mixed-precision. It’ll automatically configure your training setup based on your hardware and environment. Take a look at the 🤗 Accelerate Quick tour to learn more.

Initialize an 🤗 Accelerate environment:

accelerate config

To setup a default 🤗 Accelerate environment without choosing any configurations:

accelerate config default

Or if your environment doesn’t support an interactive shell, like a notebook, you can use:

from accelerate.utils import write_basic_config

write_basic_config()

Lastly, if you want to train a model on your own dataset, take a look at the Create a dataset for training guide to learn how to create a dataset that works with the training script.

The following sections highlight parts of the training script that are important for understanding how to modify it, but it doesn’t cover every aspect of the script in detail. If you’re interested in learning more, feel free to read through the script and let us know if you have any questions or concerns.

Script parameters

The training script provides many parameters to help you customize your training run. All of the parameters and their descriptions are found in the parse_args() function. It provides default values for each parameter, such as the training batch size and learning rate, but you can also set your own values in the training command if you’d like.

For example, to activate gradient accumulation, add the --gradient_accumulation_steps parameter to the training command:

accelerate launch train_t2i_adapter_sdxl.py \
  ----gradient_accumulation_steps=4

Many of the basic and important parameters are described in the Text-to-image training guide, so this guide just focuses on the relevant T2I-Adapter parameters:

--pretrained_vae_model_name_or_path: path to a pretrained VAE; the SDXL VAE is known to suffer from numerical instability, so this parameter allows you to specify a better VAE
--crops_coords_top_left_h and --crops_coords_top_left_w: height and width coordinates to include in SDXL’s crop coordinate embeddings
--conditioning_image_column: the column of the conditioning images in the dataset
--proportion_empty_prompts: the proportion of image prompts to replace with empty strings

Training script

As with the script parameters, a walkthrough of the training script is provided in the Text-to-image training guide. Instead, this guide takes a look at the T2I-Adapter relevant parts of the script.

The training script begins by preparing the dataset. This incudes tokenizing the prompt and applying transforms to the images and conditioning images.

conditioning_image_transforms = transforms.Compose(
    [
        transforms.Resize(args.resolution, interpolation=transforms.InterpolationMode.BILINEAR),
        transforms.CenterCrop(args.resolution),
        transforms.ToTensor(),
    ]
)

Within the main() function, the T2I-Adapter is either loaded from a pretrained adapter or it is randomly initialized:

if args.adapter_model_name_or_path:
    logger.info("Loading existing adapter weights.")
    t2iadapter = T2IAdapter.from_pretrained(args.adapter_model_name_or_path)
else:
    logger.info("Initializing t2iadapter weights.")
    t2iadapter = T2IAdapter(
        in_channels=3,
        channels=(320, 640, 1280, 1280),
        num_res_blocks=2,
        downscale_factor=16,
        adapter_type="full_adapter_xl",
    )

The optimizer is initialized for the T2I-Adapter parameters:

params_to_optimize = t2iadapter.parameters()
optimizer = optimizer_class(
    params_to_optimize,
    lr=args.learning_rate,
    betas=(args.adam_beta1, args.adam_beta2),
    weight_decay=args.adam_weight_decay,
    eps=args.adam_epsilon,
)

Lastly, in the training loop, the adapter conditioning image and the text embeddings are passed to the UNet to predict the noise residual:

t2iadapter_image = batch["conditioning_pixel_values"].to(dtype=weight_dtype)
down_block_additional_residuals = t2iadapter(t2iadapter_image)
down_block_additional_residuals = [
    sample.to(dtype=weight_dtype) for sample in down_block_additional_residuals
]

model_pred = unet(
    inp_noisy_latents,
    timesteps,
    encoder_hidden_states=batch["prompt_ids"],
    added_cond_kwargs=batch["unet_added_conditions"],
    down_block_additional_residuals=down_block_additional_residuals,
).sample

If you want to learn more about how the training loop works, check out the Understanding pipelines, models and schedulers tutorial which breaks down the basic pattern of the denoising process.

Launch the script

Now you’re ready to launch the training script! 🚀

For this example training, you’ll use the fusing/fill50k dataset. You can also create and use your own dataset if you want (see the Create a dataset for training guide).

Set the environment variable MODEL_DIR to a model id on the Hub or a path to a local model and OUTPUT_DIR to where you want to save the model.

Download the following images to condition your training with:

wget https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet_training/conditioning_image_1.png
wget https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet_training/conditioning_image_2.png

To monitor training progress with Weights & Biases, add the --report_to=wandb parameter to the training command. You’ll also need to add the --validation_image, --validation_prompt, and --validation_steps to the training command to keep track of results. This can be really useful for debugging the model and viewing intermediate results.

export MODEL_DIR="stabilityai/stable-diffusion-xl-base-1.0"
export OUTPUT_DIR="path to save model"

accelerate launch train_t2i_adapter_sdxl.py \
 --pretrained_model_name_or_path=$MODEL_DIR \
 --output_dir=$OUTPUT_DIR \
 --dataset_name=fusing/fill50k \
 --mixed_precision="fp16" \
 --resolution=1024 \
 --learning_rate=1e-5 \
 --max_train_steps=15000 \
 --validation_image "./conditioning_image_1.png" "./conditioning_image_2.png" \
 --validation_prompt "red circle with blue background" "cyan circle with brown floral background" \
 --validation_steps=100 \
 --train_batch_size=1 \
 --gradient_accumulation_steps=4 \
 --report_to="wandb" \
 --seed=42 \
 --push_to_hub

Once training is complete, you can use your T2I-Adapter for inference:

from diffusers import StableDiffusionXLAdapterPipeline, T2IAdapter, EulerAncestralDiscreteSchedulerTest
from diffusers.utils import load_image
import torch

adapter = T2IAdapter.from_pretrained("path/to/adapter", torch_dtype=torch.float16)
pipeline = StableDiffusionXLAdapterPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", adapter=adapter, torch_dtype=torch.float16
)

pipeline.scheduler = EulerAncestralDiscreteSchedulerTest.from_config(pipe.scheduler.config)
pipeline.enable_xformers_memory_efficient_attention()
pipeline.enable_model_cpu_offload()

control_image = load_image("./conditioning_image_1.png")
prompt = "pale golden rod circle with old lace background"

generator = torch.manual_seed(0)
image = pipeline(
    prompt, image=control_image, generator=generator
).images[0]
image.save("./output.png")

Next steps

Congratulations on training a T2I-Adapter model! 🎉 To learn more:

Read the Efficient Controllable Generation for SDXL with T2I-Adapters blog post to learn more details about the experimental results from the T2I-Adapter team.

< > Update on GitHub

←ControlNet InstructPix2Pix→