BLIP-Diffusion
BLIP-Diffusion was proposed in BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing. It enables zero-shot subject-driven generation and control-guided zero-shot generation.
The abstract from the paper is:
Subject-driven text-to-image generation models create novel renditions of an input subject based on text prompts. Existing models suffer from lengthy fine-tuning and difficulties preserving the subject fidelity. To overcome these limitations, we introduce BLIP-Diffusion, a new subject-driven image generation model that supports multimodal control which consumes inputs of subject images and text prompts. Unlike other subject-driven generation models, BLIP-Diffusion introduces a new multimodal encoder which is pre-trained to provide subject representation. We first pre-train the multimodal encoder following BLIP-2 to produce visual representation aligned with the text. Then we design a subject representation learning task which enables a diffusion model to leverage such visual representation and generates new subject renditions. Compared with previous methods such as DreamBooth, our model enables zero-shot subject-driven generation, and efficient fine-tuning for customized subject with up to 20x speedup. We also demonstrate that BLIP-Diffusion can be flexibly combined with existing techniques such as ControlNet and prompt-to-prompt to enable novel subject-driven generation and editing applications. Project page at this https URL.
The original codebase can be found at salesforce/LAVIS. You can find the official BLIP-Diffusion checkpoints under the hf.co/SalesForce organization.
BlipDiffusionPipeline
and BlipDiffusionControlNetPipeline
were contributed by ayushtues
.
Make sure to check out the Schedulers guide to learn how to explore the tradeoff between scheduler speed and quality, and see the reuse components across pipelines section to learn how to efficiently load the same components into multiple pipelines.
BlipDiffusionPipeline
class diffusers.BlipDiffusionPipeline
< source >( tokenizer: CLIPTokenizer text_encoder: ContextCLIPTextModel vae: AutoencoderKL unet: UNet2DConditionModel scheduler: PNDMScheduler qformer: Blip2QFormerModel image_processor: BlipImageProcessor ctx_begin_pos: int = 2 mean: typing.List[float] = None std: typing.List[float] = None )
Parameters
- tokenizer (
CLIPTokenizer
) — Tokenizer for the text encoder - text_encoder (
ContextCLIPTextModel
) — Text encoder to encode the text prompt - vae (AutoencoderKL) — VAE model to map the latents to the image
- unet (UNet2DConditionModel) — Conditional U-Net architecture to denoise the image embedding.
- scheduler (PNDMScheduler) —
A scheduler to be used in combination with
unet
to generate image latents. - qformer (
Blip2QFormerModel
) — QFormer model to get multi-modal embeddings from the text and image. - image_processor (
BlipImageProcessor
) — Image Processor to preprocess and postprocess the image. - ctx_begin_pos (int,
optional
, defaults to 2) — Position of the context token in the text encoder.
Pipeline for Zero-Shot Subject Driven Generation using Blip Diffusion.
This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods the library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)
__call__
< source >( prompt: typing.List[str] reference_image: Image source_subject_category: typing.List[str] target_subject_category: typing.List[str] latents: typing.Optional[torch.Tensor] = None guidance_scale: float = 7.5 height: int = 512 width: int = 512 num_inference_steps: int = 50 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None neg_prompt: typing.Optional[str] = '' prompt_strength: float = 1.0 prompt_reps: int = 20 output_type: typing.Optional[str] = 'pil' return_dict: bool = True ) → ImagePipelineOutput or tuple
Parameters
- prompt (
List[str]
) — The prompt or prompts to guide the image generation. - reference_image (
PIL.Image.Image
) — The reference image to condition the generation on. - source_subject_category (
List[str]
) — The source subject category. - target_subject_category (
List[str]
) — The target subject category. - latents (
torch.Tensor
, optional) — Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor will ge generated by random sampling. - guidance_scale (
float
, optional, defaults to 7.5) — Guidance scale as defined in Classifier-Free Diffusion Guidance.guidance_scale
is defined asw
of equation 2. of Imagen Paper. Guidance scale is enabled by settingguidance_scale > 1
. Higher guidance scale encourages to generate images that are closely linked to the textprompt
, usually at the expense of lower image quality. - height (
int
, optional, defaults to 512) — The height of the generated image. - width (
int
, optional, defaults to 512) — The width of the generated image. - num_inference_steps (
int
, optional, defaults to 50) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. - generator (
torch.Generator
orList[torch.Generator]
, optional) — One or a list of torch generator(s) to make generation deterministic. - neg_prompt (
str
, optional, defaults to "") — The prompt or prompts not to guide the image generation. Ignored when not using guidance (i.e., ignored ifguidance_scale
is less than1
). - prompt_strength (
float
, optional, defaults to 1.0) — The strength of the prompt. Specifies the number of times the prompt is repeated along with prompt_reps to amplify the prompt. - prompt_reps (
int
, optional, defaults to 20) — The number of times the prompt is repeated along with prompt_strength to amplify the prompt. - output_type (
str
, optional, defaults to"pil"
) — The output format of the generate image. Choose between:"pil"
(PIL.Image.Image
),"np"
(np.array
) or"pt"
(torch.Tensor
). - return_dict (
bool
, optional, defaults toTrue
) — Whether or not to return a ImagePipelineOutput instead of a plain tuple.
Returns
ImagePipelineOutput or tuple
Function invoked when calling the pipeline for generation.
Examples:
>>> from diffusers.pipelines import BlipDiffusionPipeline
>>> from diffusers.utils import load_image
>>> import torch
>>> blip_diffusion_pipe = BlipDiffusionPipeline.from_pretrained(
... "Salesforce/blipdiffusion", torch_dtype=torch.float16
... ).to("cuda")
>>> cond_subject = "dog"
>>> tgt_subject = "dog"
>>> text_prompt_input = "swimming underwater"
>>> cond_image = load_image(
... "https://huggingface.co/datasets/ayushtues/blipdiffusion_images/resolve/main/dog.jpg"
... )
>>> guidance_scale = 7.5
>>> num_inference_steps = 25
>>> negative_prompt = "over-exposure, under-exposure, saturated, duplicate, out of frame, lowres, cropped, worst quality, low quality, jpeg artifacts, morbid, mutilated, out of frame, ugly, bad anatomy, bad proportions, deformed, blurry, duplicate"
>>> output = blip_diffusion_pipe(
... text_prompt_input,
... cond_image,
... cond_subject,
... tgt_subject,
... guidance_scale=guidance_scale,
... num_inference_steps=num_inference_steps,
... neg_prompt=negative_prompt,
... height=512,
... width=512,
... ).images
>>> output[0].save("image.png")
BlipDiffusionControlNetPipeline
class diffusers.BlipDiffusionControlNetPipeline
< source >( tokenizer: CLIPTokenizer text_encoder: ContextCLIPTextModel vae: AutoencoderKL unet: UNet2DConditionModel scheduler: PNDMScheduler qformer: Blip2QFormerModel controlnet: ControlNetModel image_processor: BlipImageProcessor ctx_begin_pos: int = 2 mean: typing.List[float] = None std: typing.List[float] = None )
Parameters
- tokenizer (
CLIPTokenizer
) — Tokenizer for the text encoder - text_encoder (
ContextCLIPTextModel
) — Text encoder to encode the text prompt - vae (AutoencoderKL) — VAE model to map the latents to the image
- unet (UNet2DConditionModel) — Conditional U-Net architecture to denoise the image embedding.
- scheduler (PNDMScheduler) —
A scheduler to be used in combination with
unet
to generate image latents. - qformer (
Blip2QFormerModel
) — QFormer model to get multi-modal embeddings from the text and image. - controlnet (ControlNetModel) — ControlNet model to get the conditioning image embedding.
- image_processor (
BlipImageProcessor
) — Image Processor to preprocess and postprocess the image. - ctx_begin_pos (int,
optional
, defaults to 2) — Position of the context token in the text encoder.
Pipeline for Canny Edge based Controlled subject-driven generation using Blip Diffusion.
This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods the library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)
__call__
< source >( prompt: typing.List[str] reference_image: Image condtioning_image: Image source_subject_category: typing.List[str] target_subject_category: typing.List[str] latents: typing.Optional[torch.Tensor] = None guidance_scale: float = 7.5 height: int = 512 width: int = 512 num_inference_steps: int = 50 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None neg_prompt: typing.Optional[str] = '' prompt_strength: float = 1.0 prompt_reps: int = 20 output_type: typing.Optional[str] = 'pil' return_dict: bool = True ) → ImagePipelineOutput or tuple
Parameters
- prompt (
List[str]
) — The prompt or prompts to guide the image generation. - reference_image (
PIL.Image.Image
) — The reference image to condition the generation on. - condtioning_image (
PIL.Image.Image
) — The conditioning canny edge image to condition the generation on. - source_subject_category (
List[str]
) — The source subject category. - target_subject_category (
List[str]
) — The target subject category. - latents (
torch.Tensor
, optional) — Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor will ge generated by random sampling. - guidance_scale (
float
, optional, defaults to 7.5) — Guidance scale as defined in Classifier-Free Diffusion Guidance.guidance_scale
is defined asw
of equation 2. of Imagen Paper. Guidance scale is enabled by settingguidance_scale > 1
. Higher guidance scale encourages to generate images that are closely linked to the textprompt
, usually at the expense of lower image quality. - height (
int
, optional, defaults to 512) — The height of the generated image. - width (
int
, optional, defaults to 512) — The width of the generated image. - seed (
int
, optional, defaults to 42) — The seed to use for random generation. - num_inference_steps (
int
, optional, defaults to 50) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. - generator (
torch.Generator
orList[torch.Generator]
, optional) — One or a list of torch generator(s) to make generation deterministic. - neg_prompt (
str
, optional, defaults to "") — The prompt or prompts not to guide the image generation. Ignored when not using guidance (i.e., ignored ifguidance_scale
is less than1
). - prompt_strength (
float
, optional, defaults to 1.0) — The strength of the prompt. Specifies the number of times the prompt is repeated along with prompt_reps to amplify the prompt. - prompt_reps (
int
, optional, defaults to 20) — The number of times the prompt is repeated along with prompt_strength to amplify the prompt.
Returns
ImagePipelineOutput or tuple
Function invoked when calling the pipeline for generation.
Examples:
>>> from diffusers.pipelines import BlipDiffusionControlNetPipeline
>>> from diffusers.utils import load_image
>>> from controlnet_aux import CannyDetector
>>> import torch
>>> blip_diffusion_pipe = BlipDiffusionControlNetPipeline.from_pretrained(
... "Salesforce/blipdiffusion-controlnet", torch_dtype=torch.float16
... ).to("cuda")
>>> style_subject = "flower"
>>> tgt_subject = "teapot"
>>> text_prompt = "on a marble table"
>>> cldm_cond_image = load_image(
... "https://huggingface.co/datasets/ayushtues/blipdiffusion_images/resolve/main/kettle.jpg"
... ).resize((512, 512))
>>> canny = CannyDetector()
>>> cldm_cond_image = canny(cldm_cond_image, 30, 70, output_type="pil")
>>> style_image = load_image(
... "https://huggingface.co/datasets/ayushtues/blipdiffusion_images/resolve/main/flower.jpg"
... )
>>> guidance_scale = 7.5
>>> num_inference_steps = 50
>>> negative_prompt = "over-exposure, under-exposure, saturated, duplicate, out of frame, lowres, cropped, worst quality, low quality, jpeg artifacts, morbid, mutilated, out of frame, ugly, bad anatomy, bad proportions, deformed, blurry, duplicate"
>>> output = blip_diffusion_pipe(
... text_prompt,
... style_image,
... cldm_cond_image,
... style_subject,
... tgt_subject,
... guidance_scale=guidance_scale,
... num_inference_steps=num_inference_steps,
... neg_prompt=negative_prompt,
... height=512,
... width=512,
... ).images
>>> output[0].save("image.png")