Text-to-Video
Diffusers
longlian commited on
Commit
25346f3
0 Parent(s):
.gitattributes ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,65 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - text-to-video
4
+ duplicated_from: cerspense/zeroscope_v2_576w
5
+ ---
6
+
7
+ # LLM-grounded Video Diffusion Models
8
+ [Long Lian](https://tonylian.com/), [Baifeng Shi](https://bfshi.github.io/), [Adam Yala](https://www.adamyala.org/), [Trevor Darrell](https://people.eecs.berkeley.edu/~trevor/), [Boyi Li](https://sites.google.com/site/boyilics/home) at UC Berkeley/UCSF. **ICLR 2024**.
9
+
10
+ [Project Page](https://llm-grounded-video-diffusion.github.io/) | [Related Project: LMD](https://llm-grounded-diffusion.github.io/) | [Citation](https://llm-grounded-video-diffusion.github.io/#citation)
11
+
12
+ This model is based on [zeroscope](https://huggingface.co/cerspense/zeroscope_v2_576w) but with additional conditioning from bounding boxes in a [GLIGEN](https://gligen.github.io/) fashion.
13
+
14
+ Similar to [LLM-grounded Diffusion (LMD)](https://llm-grounded-diffusion.github.io/), LLM-grounded Video Diffusion (LVD)'s boxes-to-video stage allows cross-attention-based bounding box conditioning, which uses Zeroscope off-the-shelf. This huggingface model offers an alternative: we train a GLIGEN model (i.e., transformer adapters) with Zeroscope's weights without the temporal transformers blocks on [SA-1B](https://ai.meta.com/datasets/segment-anything/), treating it as a SD v2.1 model that has been fine-tuned to 256x256 resolution. We then merge the adapters into Zeroscope to offer conditioning. The resulting model is in this hugginface model. This can be used with cross-attention-based conditioning or on its own, similar to [LMD+](https://github.com/TonyLianLong/LLM-groundedDiffusion). This can be used with LLM-based text-to-dynamic scene layout generator in LVD, or on its own as a video version of GLIGEN.
15
+
16
+ ## Citation (LVD)
17
+ If you use our work, model, or our implementation in this repo, or find them helpful, please consider giving a citation.
18
+ ```
19
+ @article{lian2023llmgroundedvideo,
20
+ title={LLM-grounded Video Diffusion Models},
21
+ author={Lian, Long and Shi, Baifeng and Yala, Adam and Darrell, Trevor and Li, Boyi},
22
+ journal={arXiv preprint arXiv:2309.17444},
23
+ year={2023},
24
+ }
25
+
26
+ @article{lian2023llmgrounded,
27
+ title={LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models},
28
+ author={Lian, Long and Li, Boyi and Yala, Adam and Darrell, Trevor},
29
+ journal={arXiv preprint arXiv:2305.13655},
30
+ year={2023}
31
+ }
32
+ ```
33
+
34
+ ## Citation (GLIGEN)
35
+ The adapters in this model are trained in a mannar similar to training GLIGEN adapters.
36
+ ```
37
+ @article{li2023gligen,
38
+ title={GLIGEN: Open-Set Grounded Text-to-Image Generation},
39
+ author={Li, Yuheng and Liu, Haotian and Wu, Qingyang and Mu, Fangzhou and Yang, Jianwei and Gao, Jianfeng and Li, Chunyuan and Lee, Yong Jae},
40
+ journal={CVPR},
41
+ year={2023}
42
+ }
43
+ ```
44
+
45
+ ## Citation (ModelScope)
46
+ ModelScope is LVD's base model.
47
+
48
+ ```
49
+ @article{wang2023modelscope,
50
+ title={Modelscope text-to-video technical report},
51
+ author={Wang, Jiuniu and Yuan, Hangjie and Chen, Dayou and Zhang, Yingya and Wang, Xiang and Zhang, Shiwei},
52
+ journal={arXiv preprint arXiv:2308.06571},
53
+ year={2023}
54
+ }
55
+ @InProceedings{VideoFusion,
56
+ author = {Luo, Zhengxiong and Chen, Dayou and Zhang, Yingya and Huang, Yan and Wang, Liang and Shen, Yujun and Zhao, Deli and Zhou, Jingren and Tan, Tieniu},
57
+ title = {VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation},
58
+ booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
59
+ month = {June},
60
+ year = {2023}
61
+ }
62
+ ```
63
+
64
+ ## LICENSE
65
+ Zeroscope follows CC-BY-NC 4.0 license. The gligen adapters are trained on SA-1B, which follows [SA-1B license](https://ai.meta.com/datasets/segment-anything/).
README_old.md ADDED
@@ -0,0 +1,64 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: text-to-video
3
+ license: cc-by-nc-4.0
4
+ ---
5
+
6
+ ![model example](https://i.imgur.com/1mrNnh8.png)
7
+
8
+ # zeroscope_v2 576w
9
+ A watermark-free Modelscope-based video model optimized for producing high-quality 16:9 compositions and a smooth video output. This model was trained from the [original weights](https://huggingface.co/damo-vilab/modelscope-damo-text-to-video-synthesis) using 9,923 clips and 29,769 tagged frames at 24 frames, 576x320 resolution.<br />
10
+ zeroscope_v2_567w is specifically designed for upscaling with [zeroscope_v2_XL](https://huggingface.co/cerspense/zeroscope_v2_XL) using vid2vid in the [1111 text2video](https://github.com/kabachuha/sd-webui-text2video) extension by [kabachuha](https://github.com/kabachuha). Leveraging this model as a preliminary step allows for superior overall compositions at higher resolutions in zeroscope_v2_XL, permitting faster exploration in 576x320 before transitioning to a high-resolution render. See some [example outputs](https://www.youtube.com/watch?v=HO3APT_0UA4) that have been upscaled to 1024x576 using zeroscope_v2_XL. (courtesy of [dotsimulate](https://www.instagram.com/dotsimulate/))<br />
11
+
12
+ zeroscope_v2_576w uses 7.9gb of vram when rendering 30 frames at 576x320
13
+
14
+ ### Using it with the 1111 text2video extension
15
+
16
+ 1. Download files in the zs2_576w folder.
17
+ 2. Replace the respective files in the 'stable-diffusion-webui\models\ModelScope\t2v' directory.
18
+
19
+ ### Upscaling recommendations
20
+
21
+ For upscaling, it's recommended to use [zeroscope_v2_XL](https://huggingface.co/cerspense/zeroscope_v2_XL) via vid2vid in the 1111 extension. It works best at 1024x576 with a denoise strength between 0.66 and 0.85. Remember to use the same prompt that was used to generate the original clip. <br />
22
+
23
+ ### Usage in 🧨 Diffusers
24
+
25
+ Let's first install the libraries required:
26
+
27
+ ```bash
28
+ $ pip install diffusers transformers accelerate torch
29
+ ```
30
+
31
+ Now, generate a video:
32
+
33
+ ```py
34
+ import torch
35
+ from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
36
+ from diffusers.utils import export_to_video
37
+
38
+ pipe = DiffusionPipeline.from_pretrained("cerspense/zeroscope_v2_576w", torch_dtype=torch.float16)
39
+ pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
40
+ pipe.enable_model_cpu_offload()
41
+
42
+ prompt = "Darth Vader is surfing on waves"
43
+ video_frames = pipe(prompt, num_inference_steps=40, height=320, width=576, num_frames=24).frames
44
+ video_path = export_to_video(video_frames)
45
+ ```
46
+
47
+ Here are some results:
48
+
49
+ <table>
50
+ <tr>
51
+ Darth vader is surfing on waves.
52
+ <br>
53
+ <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/darthvader_cerpense.gif"
54
+ alt="Darth vader surfing in waves."
55
+ style="width: 576;" />
56
+ </center></td>
57
+ </tr>
58
+ </table>
59
+
60
+ ### Known issues
61
+
62
+ Lower resolutions or fewer frames could lead to suboptimal output. <br />
63
+
64
+ Thanks to [camenduru](https://github.com/camenduru), [kabachuha](https://github.com/kabachuha), [ExponentialML](https://github.com/ExponentialML), [dotsimulate](https://www.instagram.com/dotsimulate/), [VANYA](https://twitter.com/veryVANYA), [polyware](https://twitter.com/polyware_ai), [tin2tin](https://github.com/tin2tin)<br />
lvd_pipeline.py ADDED
@@ -0,0 +1,872 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright 2024 LLM-grounded Video Diffusion Models (LVD) Team and The HuggingFace Team. All rights reserved.
2
+ # Copyright 2024 Alibaba DAMO-VILAB and The HuggingFace Team. All rights reserved.
3
+ # Copyright 2024 The ModelScope Team.
4
+ #
5
+ # Licensed under the Apache License, Version 2.0 (the "License");
6
+ # you may not use this file except in compliance with the License.
7
+ # You may obtain a copy of the License at
8
+ #
9
+ # http://www.apache.org/licenses/LICENSE-2.0
10
+ #
11
+ # Unless required by applicable law or agreed to in writing, software
12
+ # distributed under the License is distributed on an "AS IS" BASIS,
13
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14
+ # See the License for the specific language governing permissions and
15
+ # limitations under the License.
16
+
17
+ import inspect
18
+ import warnings
19
+ from typing import Any, Callable, Dict, List, Optional, Union
20
+
21
+ import torch
22
+ import numpy as np
23
+ from diffusers.loaders import LoraLoaderMixin, TextualInversionLoaderMixin
24
+ from diffusers.models import AutoencoderKL
25
+ from diffusers.models.attention import GatedSelfAttentionDense
26
+ from diffusers.models.lora import adjust_lora_scale_text_encoder
27
+ from diffusers.models.unets import UNet3DConditionModel
28
+ from diffusers.pipelines.pipeline_utils import DiffusionPipeline
29
+ from diffusers.pipelines.text_to_video_synthesis import \
30
+ TextToVideoSDPipelineOutput
31
+ from diffusers.schedulers import KarrasDiffusionSchedulers
32
+ from diffusers.utils import (USE_PEFT_BACKEND, deprecate, logging,
33
+ replace_example_docstring, scale_lora_layers,
34
+ unscale_lora_layers)
35
+ from diffusers.utils.torch_utils import randn_tensor
36
+ from transformers import CLIPTextModel, CLIPTokenizer
37
+
38
+ logger = logging.get_logger(__name__) # pylint: disable=invalid-name
39
+
40
+ EXAMPLE_DOC_STRING = """
41
+ Examples:
42
+ ```py
43
+ >>> import torch
44
+ >>> from diffusers import TextToVideoSDPipeline
45
+ >>> from diffusers.utils import export_to_video
46
+
47
+ >>> pipe = TextToVideoSDPipeline.from_pretrained(
48
+ ... "damo-vilab/text-to-video-ms-1.7b", torch_dtype=torch.float16, variant="fp16"
49
+ ... )
50
+ >>> pipe.enable_model_cpu_offload()
51
+
52
+ >>> prompt = "Spiderman is surfing"
53
+ >>> video_frames = pipe(prompt).frames
54
+ >>> video_path = export_to_video(video_frames)
55
+ >>> video_path
56
+ ```
57
+ """
58
+
59
+
60
+ def tensor2vid(video: torch.Tensor, mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]) -> List[np.ndarray]:
61
+ # This code is copied from https://github.com/modelscope/modelscope/blob/1509fdb973e5871f37148a4b5e5964cafd43e64d/modelscope/pipelines/multi_modal/text_to_video_synthesis_pipeline.py#L78
62
+ # reshape to ncfhw
63
+ mean = torch.tensor(mean, device=video.device).reshape(1, -1, 1, 1, 1)
64
+ std = torch.tensor(std, device=video.device).reshape(1, -1, 1, 1, 1)
65
+ # unnormalize back to [0,1]
66
+ video = video.mul_(std).add_(mean)
67
+ video.clamp_(0, 1)
68
+ # prepare the final outputs
69
+ i, c, f, h, w = video.shape
70
+ images = video.permute(2, 3, 0, 4, 1).reshape(
71
+ f, h, i * w, c
72
+ ) # 1st (frames, h, batch_size, w, c) 2nd (frames, h, batch_size * w, c)
73
+ # prepare a list of indvidual (consecutive frames)
74
+ images = images.unbind(dim=0)
75
+ images = [(image.cpu().numpy() * 255).astype("uint8")
76
+ for image in images] # f h w c
77
+ return images
78
+
79
+
80
+ class GroundedTextToVideoSDPipeline(DiffusionPipeline, TextualInversionLoaderMixin, LoraLoaderMixin):
81
+ r"""
82
+ Pipeline for text-to-video generation.
83
+
84
+ This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods
85
+ implemented for all pipelines (downloading, saving, running on a particular device, etc.).
86
+
87
+ The pipeline also inherits the following loading methods:
88
+ - [`~loaders.TextualInversionLoaderMixin.load_textual_inversion`] for loading textual inversion embeddings
89
+ - [`~loaders.LoraLoaderMixin.load_lora_weights`] for loading LoRA weights
90
+ - [`~loaders.LoraLoaderMixin.save_lora_weights`] for saving LoRA weights
91
+
92
+ Args:
93
+ vae ([`AutoencoderKL`]):
94
+ Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations.
95
+ text_encoder ([`CLIPTextModel`]):
96
+ Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)).
97
+ tokenizer (`CLIPTokenizer`):
98
+ A [`~transformers.CLIPTokenizer`] to tokenize text.
99
+ unet ([`UNet3DConditionModel`]):
100
+ A [`UNet3DConditionModel`] to denoise the encoded video latents.
101
+ scheduler ([`SchedulerMixin`]):
102
+ A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of
103
+ [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`].
104
+ """
105
+
106
+ model_cpu_offload_seq = "text_encoder->unet->vae"
107
+
108
+ def __init__(
109
+ self,
110
+ vae: AutoencoderKL,
111
+ text_encoder: CLIPTextModel,
112
+ tokenizer: CLIPTokenizer,
113
+ unet: UNet3DConditionModel,
114
+ scheduler: KarrasDiffusionSchedulers,
115
+ ):
116
+ super().__init__()
117
+
118
+ self.register_modules(
119
+ vae=vae,
120
+ text_encoder=text_encoder,
121
+ tokenizer=tokenizer,
122
+ unet=unet,
123
+ scheduler=scheduler,
124
+ )
125
+ self.vae_scale_factor = 2 ** (
126
+ len(self.vae.config.block_out_channels) - 1)
127
+
128
+ # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_slicing
129
+ def enable_vae_slicing(self):
130
+ r"""
131
+ Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to
132
+ compute decoding in several steps. This is useful to save some memory and allow larger batch sizes.
133
+ """
134
+ self.vae.enable_slicing()
135
+
136
+ # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_slicing
137
+ def disable_vae_slicing(self):
138
+ r"""
139
+ Disable sliced VAE decoding. If `enable_vae_slicing` was previously enabled, this method will go back to
140
+ computing decoding in one step.
141
+ """
142
+ self.vae.disable_slicing()
143
+
144
+ # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_tiling
145
+ def enable_vae_tiling(self):
146
+ r"""
147
+ Enable tiled VAE decoding. When this option is enabled, the VAE will split the input tensor into tiles to
148
+ compute decoding and encoding in several steps. This is useful for saving a large amount of memory and to allow
149
+ processing larger images.
150
+ """
151
+ self.vae.enable_tiling()
152
+
153
+ # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_tiling
154
+ def disable_vae_tiling(self):
155
+ r"""
156
+ Disable tiled VAE decoding. If `enable_vae_tiling` was previously enabled, this method will go back to
157
+ computing decoding in one step.
158
+ """
159
+ self.vae.disable_tiling()
160
+
161
+ # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline._encode_prompt
162
+ def _encode_prompt(
163
+ self,
164
+ prompt,
165
+ device,
166
+ num_images_per_prompt,
167
+ do_classifier_free_guidance,
168
+ negative_prompt=None,
169
+ prompt_embeds: Optional[torch.FloatTensor] = None,
170
+ negative_prompt_embeds: Optional[torch.FloatTensor] = None,
171
+ lora_scale: Optional[float] = None,
172
+ **kwargs,
173
+ ):
174
+ deprecation_message = "`_encode_prompt()` is deprecated and it will be removed in a future version. Use `encode_prompt()` instead. Also, be aware that the output format changed from a concatenated tensor to a tuple."
175
+ deprecate("_encode_prompt()", "1.0.0",
176
+ deprecation_message, standard_warn=False)
177
+
178
+ prompt_embeds_tuple = self.encode_prompt(
179
+ prompt=prompt,
180
+ device=device,
181
+ num_images_per_prompt=num_images_per_prompt,
182
+ do_classifier_free_guidance=do_classifier_free_guidance,
183
+ negative_prompt=negative_prompt,
184
+ prompt_embeds=prompt_embeds,
185
+ negative_prompt_embeds=negative_prompt_embeds,
186
+ lora_scale=lora_scale,
187
+ **kwargs,
188
+ )
189
+
190
+ # concatenate for backwards comp
191
+ prompt_embeds = torch.cat(
192
+ [prompt_embeds_tuple[1], prompt_embeds_tuple[0]])
193
+
194
+ return prompt_embeds
195
+
196
+ # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.encode_prompt
197
+ def encode_prompt(
198
+ self,
199
+ prompt,
200
+ device,
201
+ num_images_per_prompt,
202
+ do_classifier_free_guidance,
203
+ negative_prompt=None,
204
+ prompt_embeds: Optional[torch.FloatTensor] = None,
205
+ negative_prompt_embeds: Optional[torch.FloatTensor] = None,
206
+ lora_scale: Optional[float] = None,
207
+ clip_skip: Optional[int] = None,
208
+ ):
209
+ r"""
210
+ Encodes the prompt into text encoder hidden states.
211
+
212
+ Args:
213
+ prompt (`str` or `List[str]`, *optional*):
214
+ prompt to be encoded
215
+ device: (`torch.device`):
216
+ torch device
217
+ num_images_per_prompt (`int`):
218
+ number of images that should be generated per prompt
219
+ do_classifier_free_guidance (`bool`):
220
+ whether to use classifier free guidance or not
221
+ negative_prompt (`str` or `List[str]`, *optional*):
222
+ The prompt or prompts not to guide the image generation. If not defined, one has to pass
223
+ `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is
224
+ less than `1`).
225
+ prompt_embeds (`torch.FloatTensor`, *optional*):
226
+ Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not
227
+ provided, text embeddings will be generated from `prompt` input argument.
228
+ negative_prompt_embeds (`torch.FloatTensor`, *optional*):
229
+ Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
230
+ weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input
231
+ argument.
232
+ lora_scale (`float`, *optional*):
233
+ A LoRA scale that will be applied to all LoRA layers of the text encoder if LoRA layers are loaded.
234
+ clip_skip (`int`, *optional*):
235
+ Number of layers to be skipped from CLIP while computing the prompt embeddings. A value of 1 means that
236
+ the output of the pre-final layer will be used for computing the prompt embeddings.
237
+ """
238
+ # set lora scale so that monkey patched LoRA
239
+ # function of text encoder can correctly access it
240
+ if lora_scale is not None and isinstance(self, LoraLoaderMixin):
241
+ self._lora_scale = lora_scale
242
+
243
+ # dynamically adjust the LoRA scale
244
+ if not USE_PEFT_BACKEND:
245
+ adjust_lora_scale_text_encoder(self.text_encoder, lora_scale)
246
+ else:
247
+ scale_lora_layers(self.text_encoder, lora_scale)
248
+
249
+ if prompt is not None and isinstance(prompt, str):
250
+ batch_size = 1
251
+ elif prompt is not None and isinstance(prompt, list):
252
+ batch_size = len(prompt)
253
+ else:
254
+ batch_size = prompt_embeds.shape[0]
255
+
256
+ if prompt_embeds is None:
257
+ # textual inversion: procecss multi-vector tokens if necessary
258
+ if isinstance(self, TextualInversionLoaderMixin):
259
+ prompt = self.maybe_convert_prompt(prompt, self.tokenizer)
260
+
261
+ text_inputs = self.tokenizer(
262
+ prompt,
263
+ padding="max_length",
264
+ max_length=self.tokenizer.model_max_length,
265
+ truncation=True,
266
+ return_tensors="pt",
267
+ )
268
+ text_input_ids = text_inputs.input_ids
269
+ untruncated_ids = self.tokenizer(
270
+ prompt, padding="longest", return_tensors="pt").input_ids
271
+
272
+ if untruncated_ids.shape[-1] >= text_input_ids.shape[-1] and not torch.equal(
273
+ text_input_ids, untruncated_ids
274
+ ):
275
+ removed_text = self.tokenizer.batch_decode(
276
+ untruncated_ids[:, self.tokenizer.model_max_length - 1: -1]
277
+ )
278
+ logger.warning(
279
+ "The following part of your input was truncated because CLIP can only handle sequences up to"
280
+ f" {self.tokenizer.model_max_length} tokens: {removed_text}"
281
+ )
282
+
283
+ if hasattr(self.text_encoder.config, "use_attention_mask") and self.text_encoder.config.use_attention_mask:
284
+ attention_mask = text_inputs.attention_mask.to(device)
285
+ else:
286
+ attention_mask = None
287
+
288
+ if clip_skip is None:
289
+ prompt_embeds = self.text_encoder(
290
+ text_input_ids.to(device), attention_mask=attention_mask)
291
+ prompt_embeds = prompt_embeds[0]
292
+ else:
293
+ prompt_embeds = self.text_encoder(
294
+ text_input_ids.to(device), attention_mask=attention_mask, output_hidden_states=True
295
+ )
296
+ # Access the `hidden_states` first, that contains a tuple of
297
+ # all the hidden states from the encoder layers. Then index into
298
+ # the tuple to access the hidden states from the desired layer.
299
+ prompt_embeds = prompt_embeds[-1][-(clip_skip + 1)]
300
+ # We also need to apply the final LayerNorm here to not mess with the
301
+ # representations. The `last_hidden_states` that we typically use for
302
+ # obtaining the final prompt representations passes through the LayerNorm
303
+ # layer.
304
+ prompt_embeds = self.text_encoder.text_model.final_layer_norm(
305
+ prompt_embeds)
306
+
307
+ if self.text_encoder is not None:
308
+ prompt_embeds_dtype = self.text_encoder.dtype
309
+ elif self.unet is not None:
310
+ prompt_embeds_dtype = self.unet.dtype
311
+ else:
312
+ prompt_embeds_dtype = prompt_embeds.dtype
313
+
314
+ prompt_embeds = prompt_embeds.to(
315
+ dtype=prompt_embeds_dtype, device=device)
316
+
317
+ bs_embed, seq_len, _ = prompt_embeds.shape
318
+ # duplicate text embeddings for each generation per prompt, using mps friendly method
319
+ prompt_embeds = prompt_embeds.repeat(1, num_images_per_prompt, 1)
320
+ prompt_embeds = prompt_embeds.view(
321
+ bs_embed * num_images_per_prompt, seq_len, -1)
322
+
323
+ # get unconditional embeddings for classifier free guidance
324
+ if do_classifier_free_guidance and negative_prompt_embeds is None:
325
+ uncond_tokens: List[str]
326
+ if negative_prompt is None:
327
+ uncond_tokens = [""] * batch_size
328
+ elif prompt is not None and type(prompt) is not type(negative_prompt):
329
+ raise TypeError(
330
+ f"`negative_prompt` should be the same type to `prompt`, but got {type(negative_prompt)} !="
331
+ f" {type(prompt)}."
332
+ )
333
+ elif isinstance(negative_prompt, str):
334
+ uncond_tokens = [negative_prompt]
335
+ elif batch_size != len(negative_prompt):
336
+ raise ValueError(
337
+ f"`negative_prompt`: {negative_prompt} has batch size {len(negative_prompt)}, but `prompt`:"
338
+ f" {prompt} has batch size {batch_size}. Please make sure that passed `negative_prompt` matches"
339
+ " the batch size of `prompt`."
340
+ )
341
+ else:
342
+ uncond_tokens = negative_prompt
343
+
344
+ # textual inversion: procecss multi-vector tokens if necessary
345
+ if isinstance(self, TextualInversionLoaderMixin):
346
+ uncond_tokens = self.maybe_convert_prompt(
347
+ uncond_tokens, self.tokenizer)
348
+
349
+ max_length = prompt_embeds.shape[1]
350
+ uncond_input = self.tokenizer(
351
+ uncond_tokens,
352
+ padding="max_length",
353
+ max_length=max_length,
354
+ truncation=True,
355
+ return_tensors="pt",
356
+ )
357
+
358
+ if hasattr(self.text_encoder.config, "use_attention_mask") and self.text_encoder.config.use_attention_mask:
359
+ attention_mask = uncond_input.attention_mask.to(device)
360
+ else:
361
+ attention_mask = None
362
+
363
+ negative_prompt_embeds = self.text_encoder(
364
+ uncond_input.input_ids.to(device),
365
+ attention_mask=attention_mask,
366
+ )
367
+ negative_prompt_embeds = negative_prompt_embeds[0]
368
+
369
+ if do_classifier_free_guidance:
370
+ # duplicate unconditional embeddings for each generation per prompt, using mps friendly method
371
+ seq_len = negative_prompt_embeds.shape[1]
372
+
373
+ negative_prompt_embeds = negative_prompt_embeds.to(
374
+ dtype=prompt_embeds_dtype, device=device)
375
+
376
+ negative_prompt_embeds = negative_prompt_embeds.repeat(
377
+ 1, num_images_per_prompt, 1)
378
+ negative_prompt_embeds = negative_prompt_embeds.view(
379
+ batch_size * num_images_per_prompt, seq_len, -1)
380
+
381
+ if isinstance(self, LoraLoaderMixin) and USE_PEFT_BACKEND:
382
+ # Retrieve the original scale by scaling back the LoRA layers
383
+ unscale_lora_layers(self.text_encoder, lora_scale)
384
+
385
+ return prompt_embeds, negative_prompt_embeds
386
+
387
+ def decode_latents(self, latents):
388
+ latents = 1 / self.vae.config.scaling_factor * latents
389
+
390
+ batch_size, channels, num_frames, height, width = latents.shape
391
+ latents = latents.permute(0, 2, 1, 3, 4).reshape(
392
+ batch_size * num_frames, channels, height, width)
393
+
394
+ image = self.vae.decode(latents).sample
395
+ video = (
396
+ image[None, :]
397
+ .reshape(
398
+ (
399
+ batch_size,
400
+ num_frames,
401
+ -1,
402
+ )
403
+ + image.shape[2:]
404
+ )
405
+ .permute(0, 2, 1, 3, 4)
406
+ )
407
+ # we always cast to float32 as this does not cause significant overhead and is compatible with bfloat16
408
+ video = video.float()
409
+ return video
410
+
411
+ # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.prepare_extra_step_kwargs
412
+ def prepare_extra_step_kwargs(self, generator, eta):
413
+ # prepare extra kwargs for the scheduler step, since not all schedulers have the same signature
414
+ # eta (η) is only used with the DDIMScheduler, it will be ignored for other schedulers.
415
+ # eta corresponds to η in DDIM paper: https://arxiv.org/abs/2010.02502
416
+ # and should be between [0, 1]
417
+
418
+ accepts_eta = "eta" in set(inspect.signature(
419
+ self.scheduler.step).parameters.keys())
420
+ extra_step_kwargs = {}
421
+ if accepts_eta:
422
+ extra_step_kwargs["eta"] = eta
423
+
424
+ # check if the scheduler accepts generator
425
+ accepts_generator = "generator" in set(
426
+ inspect.signature(self.scheduler.step).parameters.keys())
427
+ if accepts_generator:
428
+ extra_step_kwargs["generator"] = generator
429
+ return extra_step_kwargs
430
+
431
+ # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.check_inputs
432
+ def check_inputs(
433
+ self,
434
+ prompt,
435
+ height,
436
+ width,
437
+ callback_steps,
438
+ lvd_gligen_phrases,
439
+ lvd_gligen_boxes,
440
+ negative_prompt=None,
441
+ prompt_embeds=None,
442
+ negative_prompt_embeds=None,
443
+ num_frames=None,
444
+ callback_on_step_end_tensor_inputs=None,
445
+ ):
446
+ if height % 8 != 0 or width % 8 != 0:
447
+ raise ValueError(
448
+ f"`height` and `width` have to be divisible by 8 but are {height} and {width}.")
449
+
450
+ if callback_steps is not None and (not isinstance(callback_steps, int) or callback_steps <= 0):
451
+ raise ValueError(
452
+ f"`callback_steps` has to be a positive integer but is {callback_steps} of type"
453
+ f" {type(callback_steps)}."
454
+ )
455
+ if callback_on_step_end_tensor_inputs is not None and not all(
456
+ k in self._callback_tensor_inputs for k in callback_on_step_end_tensor_inputs
457
+ ):
458
+ raise ValueError(
459
+ f"`callback_on_step_end_tensor_inputs` has to be in {self._callback_tensor_inputs}, but found {[k for k in callback_on_step_end_tensor_inputs if k not in self._callback_tensor_inputs]}"
460
+ )
461
+
462
+ if prompt is not None and prompt_embeds is not None:
463
+ raise ValueError(
464
+ f"Cannot forward both `prompt`: {prompt} and `prompt_embeds`: {prompt_embeds}. Please make sure to"
465
+ " only forward one of the two."
466
+ )
467
+ elif prompt is None and prompt_embeds is None:
468
+ raise ValueError(
469
+ "Provide either `prompt` or `prompt_embeds`. Cannot leave both `prompt` and `prompt_embeds` undefined."
470
+ )
471
+ elif prompt is not None and (not isinstance(prompt, str) and not isinstance(prompt, list)):
472
+ raise ValueError(
473
+ f"`prompt` has to be of type `str` or `list` but is {type(prompt)}")
474
+
475
+ if negative_prompt is not None and negative_prompt_embeds is not None:
476
+ raise ValueError(
477
+ f"Cannot forward both `negative_prompt`: {negative_prompt} and `negative_prompt_embeds`:"
478
+ f" {negative_prompt_embeds}. Please make sure to only forward one of the two."
479
+ )
480
+
481
+ if prompt_embeds is not None and negative_prompt_embeds is not None:
482
+ if prompt_embeds.shape != negative_prompt_embeds.shape:
483
+ raise ValueError(
484
+ "`prompt_embeds` and `negative_prompt_embeds` must have the same shape when passed directly, but"
485
+ f" got: `prompt_embeds` {prompt_embeds.shape} != `negative_prompt_embeds`"
486
+ f" {negative_prompt_embeds.shape}."
487
+ )
488
+
489
+ if lvd_gligen_boxes:
490
+ if len(lvd_gligen_phrases) != num_frames or len(lvd_gligen_boxes) != num_frames:
491
+ raise ValueError(
492
+ "length of `lvd_gligen_phrases` and `lvd_gligen_boxes` has to be same and match `num_frames`, but"
493
+ f" got: `lvd_gligen_phrases` {len(lvd_gligen_phrases)}, `lvd_gligen_boxes` {len(lvd_gligen_boxes)}, `num_frames` {num_frames}"
494
+ )
495
+ else:
496
+ for frame_index, (lvd_gligen_phrases_frame, lvd_gligen_boxes_frame) in enumerate(zip(lvd_gligen_phrases, lvd_gligen_boxes)):
497
+ if len(lvd_gligen_phrases_frame) != len(lvd_gligen_boxes_frame):
498
+ raise ValueError(
499
+ "length of `lvd_gligen_phrases` and `lvd_gligen_boxes` has to be same, but"
500
+ f" got: `lvd_gligen_phrases` {len(lvd_gligen_phrases_frame)} != `lvd_gligen_boxes` {len(lvd_gligen_boxes_frame)} at frame {frame_index}"
501
+ )
502
+
503
+ def prepare_latents(
504
+ self, batch_size, num_channels_latents, num_frames, height, width, dtype, device, generator, latents=None
505
+ ):
506
+ shape = (
507
+ batch_size,
508
+ num_channels_latents,
509
+ num_frames,
510
+ height // self.vae_scale_factor,
511
+ width // self.vae_scale_factor,
512
+ )
513
+ if isinstance(generator, list) and len(generator) != batch_size:
514
+ raise ValueError(
515
+ f"You have passed a list of generators of length {len(generator)}, but requested an effective batch"
516
+ f" size of {batch_size}. Make sure the batch size matches the length of the generators."
517
+ )
518
+
519
+ if latents is None:
520
+ latents = randn_tensor(
521
+ shape, generator=generator, device=device, dtype=dtype)
522
+ else:
523
+ latents = latents.to(device)
524
+
525
+ # scale the initial noise by the standard deviation required by the scheduler
526
+ latents = latents * self.scheduler.init_noise_sigma
527
+ return latents
528
+
529
+ def enable_fuser(self, enabled=True):
530
+ for module in self.unet.modules():
531
+ if type(module) is GatedSelfAttentionDense:
532
+ module.enabled = enabled
533
+
534
+ # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_freeu
535
+ def enable_freeu(self, s1: float, s2: float, b1: float, b2: float):
536
+ r"""Enables the FreeU mechanism as in https://arxiv.org/abs/2309.11497.
537
+
538
+ The suffixes after the scaling factors represent the stages where they are being applied.
539
+
540
+ Please refer to the [official repository](https://github.com/ChenyangSi/FreeU) for combinations of the values
541
+ that are known to work well for different pipelines such as Stable Diffusion v1, v2, and Stable Diffusion XL.
542
+
543
+ Args:
544
+ s1 (`float`):
545
+ Scaling factor for stage 1 to attenuate the contributions of the skip features. This is done to
546
+ mitigate "oversmoothing effect" in the enhanced denoising process.
547
+ s2 (`float`):
548
+ Scaling factor for stage 2 to attenuate the contributions of the skip features. This is done to
549
+ mitigate "oversmoothing effect" in the enhanced denoising process.
550
+ b1 (`float`): Scaling factor for stage 1 to amplify the contributions of backbone features.
551
+ b2 (`float`): Scaling factor for stage 2 to amplify the contributions of backbone features.
552
+ """
553
+ if not hasattr(self, "unet"):
554
+ raise ValueError("The pipeline must have `unet` for using FreeU.")
555
+ self.unet.enable_freeu(s1=s1, s2=s2, b1=b1, b2=b2)
556
+
557
+ # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_freeu
558
+ def disable_freeu(self):
559
+ """Disables the FreeU mechanism if enabled."""
560
+ self.unet.disable_freeu()
561
+
562
+ @torch.no_grad()
563
+ @replace_example_docstring(EXAMPLE_DOC_STRING)
564
+ def __call__(
565
+ self,
566
+ prompt: Union[str, List[str]] = None,
567
+ height: Optional[int] = None,
568
+ width: Optional[int] = None,
569
+ num_frames: int = 16,
570
+ num_inference_steps: int = 50,
571
+ guidance_scale: float = 9.0,
572
+ lvd_gligen_scheduled_sampling_beta: float = 0.3,
573
+ lvd_gligen_phrases: List[List[str]] = None,
574
+ lvd_gligen_boxes: List[List[List[float]]] = None,
575
+ negative_prompt: Optional[Union[str, List[str]]] = None,
576
+ eta: float = 0.0,
577
+ generator: Optional[Union[torch.Generator,
578
+ List[torch.Generator]]] = None,
579
+ latents: Optional[torch.FloatTensor] = None,
580
+ prompt_embeds: Optional[torch.FloatTensor] = None,
581
+ negative_prompt_embeds: Optional[torch.FloatTensor] = None,
582
+ output_type: Optional[str] = "np",
583
+ return_dict: bool = True,
584
+ callback: Optional[Callable[[
585
+ int, int, torch.FloatTensor], None]] = None,
586
+ callback_steps: int = 1,
587
+ cross_attention_kwargs: Optional[Dict[str, Any]] = None,
588
+ clip_skip: Optional[int] = None,
589
+ ):
590
+ r"""
591
+ The call function to the pipeline for generation.
592
+
593
+ Args:
594
+ prompt (`str` or `List[str]`, *optional*):
595
+ The prompt or prompts to guide image generation. If not defined, you need to pass `prompt_embeds`.
596
+ height (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`):
597
+ The height in pixels of the generated video.
598
+ width (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`):
599
+ The width in pixels of the generated video.
600
+ num_frames (`int`, *optional*, defaults to 16):
601
+ The number of video frames that are generated. Defaults to 16 frames which at 8 frames per seconds
602
+ amounts to 2 seconds of video.
603
+ num_inference_steps (`int`, *optional*, defaults to 50):
604
+ The number of denoising steps. More denoising steps usually lead to a higher quality videos at the
605
+ expense of slower inference.
606
+ guidance_scale (`float`, *optional*, defaults to 7.5):
607
+ A higher guidance scale value encourages the model to generate images closely linked to the text
608
+ `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`.
609
+ lvd_gligen_phrases (`List[str]`):
610
+ The phrases to guide what to include in each of the regions defined by the corresponding
611
+ `lvd_gligen_boxes`. There should only be one phrase per bounding box.
612
+ lvd_gligen_boxes (`List[List[float]]`):
613
+ The bounding boxes that identify rectangular regions of the image that are going to be filled with the
614
+ content described by the corresponding `lvd_gligen_phrases`. Each rectangular box is defined as a
615
+ `List[float]` of 4 elements `[xmin, ymin, xmax, ymax]` where each value is between [0,1].
616
+ lvd_gligen_scheduled_sampling_beta (`float`, defaults to 0.3):
617
+ Scheduled Sampling factor from [GLIGEN: Open-Set Grounded Text-to-Image
618
+ Generation](https://arxiv.org/pdf/2301.07093.pdf). Scheduled Sampling factor is only varied for
619
+ scheduled sampling during inference for improved quality and controllability.
620
+ negative_prompt (`str` or `List[str]`, *optional*):
621
+ The prompt or prompts to guide what to not include in image generation. If not defined, you need to
622
+ pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`).
623
+ num_images_per_prompt (`int`, *optional*, defaults to 1):
624
+ The number of images to generate per prompt.
625
+ eta (`float`, *optional*, defaults to 0.0):
626
+ Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies
627
+ to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers.
628
+ generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
629
+ A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make
630
+ generation deterministic.
631
+ latents (`torch.FloatTensor`, *optional*):
632
+ Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for video
633
+ generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
634
+ tensor is generated by sampling using the supplied random `generator`. Latents should be of shape
635
+ `(batch_size, num_channel, num_frames, height, width)`.
636
+ prompt_embeds (`torch.FloatTensor`, *optional*):
637
+ Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not
638
+ provided, text embeddings are generated from the `prompt` input argument.
639
+ negative_prompt_embeds (`torch.FloatTensor`, *optional*):
640
+ Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If
641
+ not provided, `negative_prompt_embeds` are generated from the `negative_prompt` input argument.
642
+ output_type (`str`, *optional*, defaults to `"np"`):
643
+ The output format of the generated video. Choose between `torch.FloatTensor` or `np.array`.
644
+ return_dict (`bool`, *optional*, defaults to `True`):
645
+ Whether or not to return a [`~pipelines.text_to_video_synthesis.TextToVideoSDPipelineOutput`] instead
646
+ of a plain tuple.
647
+ callback (`Callable`, *optional*):
648
+ A function that calls every `callback_steps` steps during inference. The function is called with the
649
+ following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`.
650
+ callback_steps (`int`, *optional*, defaults to 1):
651
+ The frequency at which the `callback` function is called. If not specified, the callback is called at
652
+ every step.
653
+ cross_attention_kwargs (`dict`, *optional*):
654
+ A kwargs dictionary that if specified is passed along to the [`AttentionProcessor`] as defined in
655
+ [`self.processor`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
656
+ clip_skip (`int`, *optional*):
657
+ Number of layers to be skipped from CLIP while computing the prompt embeddings. A value of 1 means that
658
+ the output of the pre-final layer will be used for computing the prompt embeddings.
659
+ Examples:
660
+
661
+ Returns:
662
+ [`~pipelines.text_to_video_synthesis.TextToVideoSDPipelineOutput`] or `tuple`:
663
+ If `return_dict` is `True`, [`~pipelines.text_to_video_synthesis.TextToVideoSDPipelineOutput`] is
664
+ returned, otherwise a `tuple` is returned where the first element is a list with the generated frames.
665
+ """
666
+ # 0. Default height and width to unet
667
+ height = height or self.unet.config.sample_size * self.vae_scale_factor
668
+ width = width or self.unet.config.sample_size * self.vae_scale_factor
669
+
670
+ num_images_per_prompt = 1
671
+
672
+ # 1. Check inputs. Raise error if not correct
673
+ self.check_inputs(
674
+ prompt, height, width, callback_steps, lvd_gligen_phrases,
675
+ lvd_gligen_boxes, negative_prompt, prompt_embeds, negative_prompt_embeds, num_frames
676
+ )
677
+
678
+ # 2. Define call parameters
679
+ if prompt is not None and isinstance(prompt, str):
680
+ batch_size = 1
681
+ elif prompt is not None and isinstance(prompt, list):
682
+ batch_size = len(prompt)
683
+ else:
684
+ batch_size = prompt_embeds.shape[0]
685
+
686
+ device = self._execution_device
687
+ # here `guidance_scale` is defined analog to the guidance weight `w` of equation (2)
688
+ # of the Imagen paper: https://arxiv.org/pdf/2205.11487.pdf . `guidance_scale = 1`
689
+ # corresponds to doing no classifier free guidance.
690
+ do_classifier_free_guidance = guidance_scale > 1.0
691
+
692
+ # 3. Encode input prompt
693
+ text_encoder_lora_scale = (
694
+ cross_attention_kwargs.get(
695
+ "scale", None) if cross_attention_kwargs is not None else None
696
+ )
697
+ prompt_embeds, negative_prompt_embeds = self.encode_prompt(
698
+ prompt,
699
+ device,
700
+ num_images_per_prompt,
701
+ do_classifier_free_guidance,
702
+ negative_prompt,
703
+ prompt_embeds=prompt_embeds,
704
+ negative_prompt_embeds=negative_prompt_embeds,
705
+ lora_scale=text_encoder_lora_scale,
706
+ clip_skip=clip_skip,
707
+ )
708
+ # For classifier free guidance, we need to do two forward passes.
709
+ # Here we concatenate the unconditional and text embeddings into a single batch
710
+ # to avoid doing two forward passes
711
+ if do_classifier_free_guidance:
712
+ prompt_embeds = torch.cat([negative_prompt_embeds, prompt_embeds])
713
+
714
+ # 4. Prepare timesteps
715
+ self.scheduler.set_timesteps(num_inference_steps, device=device)
716
+ timesteps = self.scheduler.timesteps
717
+
718
+ # 5. Prepare latent variables
719
+ num_channels_latents = self.unet.config.in_channels
720
+ latents = self.prepare_latents(
721
+ batch_size * num_images_per_prompt,
722
+ num_channels_latents,
723
+ num_frames,
724
+ height,
725
+ width,
726
+ prompt_embeds.dtype,
727
+ device,
728
+ generator,
729
+ latents,
730
+ )
731
+
732
+ # 5.1 Prepare GLIGEN variables
733
+ if lvd_gligen_boxes:
734
+ max_objs = 30
735
+ boxes_all, text_embeddings_all, masks_all = [], [], []
736
+ for lvd_gligen_phrases_frame, lvd_gligen_boxes_frame in zip(lvd_gligen_phrases, lvd_gligen_boxes):
737
+ if len(lvd_gligen_boxes_frame) > max_objs:
738
+ warnings.warn(
739
+ f"More than {max_objs} objects found. Only first {max_objs} objects will be processed.",
740
+ FutureWarning,
741
+ )
742
+ lvd_gligen_phrases_frame = lvd_gligen_phrases_frame[:max_objs]
743
+ lvd_gligen_boxes_frame = lvd_gligen_boxes_frame[:max_objs]
744
+
745
+ # prepare batched input to the PositionNet (boxes, phrases, mask)
746
+ # Get tokens for phrases from pre-trained CLIPTokenizer
747
+ tokenizer_inputs = self.tokenizer(
748
+ lvd_gligen_phrases_frame, padding=True, return_tensors="pt").to(device)
749
+ # For the token, we use the same pre-trained text encoder
750
+ # to obtain its text feature
751
+ _text_embeddings = self.text_encoder(
752
+ **tokenizer_inputs).pooler_output
753
+ n_objs = len(lvd_gligen_boxes_frame)
754
+ # For each entity, described in phrases, is denoted with a bounding box,
755
+ # we represent the location information as (xmin,ymin,xmax,ymax)
756
+ boxes = torch.zeros(max_objs, 4, device=device,
757
+ dtype=self.text_encoder.dtype)
758
+ boxes[:n_objs] = torch.tensor(lvd_gligen_boxes_frame)
759
+ text_embeddings = torch.zeros(
760
+ max_objs, self.unet.cross_attention_dim, device=device, dtype=self.text_encoder.dtype
761
+ )
762
+ text_embeddings[:n_objs] = _text_embeddings
763
+ # Generate a mask for each object that is entity described by phrases
764
+ masks = torch.zeros(max_objs, device=device,
765
+ dtype=self.text_encoder.dtype)
766
+ masks[:n_objs] = 1
767
+
768
+ repeat_batch = batch_size * num_images_per_prompt
769
+ boxes = boxes.unsqueeze(0).expand(repeat_batch, -1, -1).clone()
770
+ text_embeddings = text_embeddings.unsqueeze(
771
+ 0).expand(repeat_batch, -1, -1).clone()
772
+ masks = masks.unsqueeze(0).expand(repeat_batch, -1).clone()
773
+ if do_classifier_free_guidance:
774
+ repeat_batch = repeat_batch * 2
775
+ boxes = torch.cat([boxes] * 2)
776
+ text_embeddings = torch.cat([text_embeddings] * 2)
777
+ masks = torch.cat([masks] * 2)
778
+ masks[: repeat_batch // 2] = 0
779
+
780
+ boxes_all.append(boxes)
781
+ text_embeddings_all.append(text_embeddings)
782
+ masks_all.append(masks)
783
+
784
+ if cross_attention_kwargs is None:
785
+ cross_attention_kwargs = {}
786
+
787
+ # In `UNet3DConditionModel`, there is a permute and reshape to merge batch dimension and frame dimension.
788
+ boxes_all = torch.stack(boxes_all, dim=1).flatten(0, 1)
789
+ text_embeddings_all = torch.stack(
790
+ text_embeddings_all, dim=1).flatten(0, 1)
791
+ masks_all = torch.stack(masks_all, dim=1).flatten(0, 1)
792
+ cross_attention_kwargs["gligen"] = {
793
+ "boxes": boxes_all, "positive_embeddings": text_embeddings_all, "masks": masks_all}
794
+
795
+ num_grounding_steps = int(
796
+ lvd_gligen_scheduled_sampling_beta * len(timesteps))
797
+ self.enable_fuser(True)
798
+
799
+ # 6. Prepare extra step kwargs. TODO: Logic should ideally just be moved out of the pipeline
800
+ extra_step_kwargs = self.prepare_extra_step_kwargs(generator, eta)
801
+
802
+ # 7. Denoising loop
803
+ num_warmup_steps = len(timesteps) - \
804
+ num_inference_steps * self.scheduler.order
805
+ with self.progress_bar(total=num_inference_steps) as progress_bar:
806
+ for i, t in enumerate(timesteps):
807
+ # Scheduled sampling
808
+ if i == num_grounding_steps:
809
+ self.enable_fuser(False)
810
+
811
+ assert latents.shape[1] == 4, f"latent channel mismatch: {latents.shape}"
812
+
813
+ # expand the latents if we are doing classifier free guidance
814
+ latent_model_input = torch.cat(
815
+ [latents] * 2) if do_classifier_free_guidance else latents
816
+ latent_model_input = self.scheduler.scale_model_input(
817
+ latent_model_input, t)
818
+
819
+ # predict the noise residual
820
+ noise_pred = self.unet(
821
+ latent_model_input,
822
+ t,
823
+ encoder_hidden_states=prompt_embeds,
824
+ cross_attention_kwargs=cross_attention_kwargs,
825
+ return_dict=False,
826
+ )[0]
827
+
828
+ # perform guidance
829
+ if do_classifier_free_guidance:
830
+ noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
831
+ noise_pred = noise_pred_uncond + guidance_scale * \
832
+ (noise_pred_text - noise_pred_uncond)
833
+
834
+ # reshape latents
835
+ bsz, channel, frames, width, height = latents.shape
836
+ latents = latents.permute(0, 2, 1, 3, 4).reshape(
837
+ bsz * frames, channel, width, height)
838
+ noise_pred = noise_pred.permute(0, 2, 1, 3, 4).reshape(
839
+ bsz * frames, channel, width, height)
840
+
841
+ # compute the previous noisy sample x_t -> x_t-1
842
+ latents = self.scheduler.step(
843
+ noise_pred, t, latents, **extra_step_kwargs).prev_sample
844
+
845
+ # reshape latents back
846
+ latents = latents[None, :].reshape(
847
+ bsz, frames, channel, width, height).permute(0, 2, 1, 3, 4)
848
+
849
+ # call the callback, if provided
850
+ if i == len(timesteps) - 1 or ((i + 1) > num_warmup_steps and (i + 1) % self.scheduler.order == 0):
851
+ progress_bar.update()
852
+ if callback is not None and i % callback_steps == 0:
853
+ step_idx = i // getattr(self.scheduler, "order", 1)
854
+ callback(step_idx, t, latents)
855
+
856
+ if output_type == "latent":
857
+ return TextToVideoSDPipelineOutput(frames=latents)
858
+
859
+ video_tensor = self.decode_latents(latents)
860
+
861
+ if output_type == "pt":
862
+ video = video_tensor
863
+ else:
864
+ video = tensor2vid(video_tensor)
865
+
866
+ # Offload all models
867
+ self.maybe_free_model_hooks()
868
+
869
+ if not return_dict:
870
+ return (video,)
871
+
872
+ return TextToVideoSDPipelineOutput(frames=video)
model_index.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": ["lvd_pipeline", "GroundedTextToVideoSDPipeline"],
3
+ "_diffusers_version": "0.17.0.dev0",
4
+ "scheduler": [
5
+ "diffusers",
6
+ "DDIMScheduler"
7
+ ],
8
+ "text_encoder": [
9
+ "transformers",
10
+ "CLIPTextModel"
11
+ ],
12
+ "tokenizer": [
13
+ "transformers",
14
+ "CLIPTokenizer"
15
+ ],
16
+ "unet": [
17
+ "lvd_unet_3d_condition",
18
+ "GroundedUNet3DConditionModel"
19
+ ],
20
+ "vae": [
21
+ "diffusers",
22
+ "AutoencoderKL"
23
+ ]
24
+ }
scheduler/scheduler_config.json ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "DDIMScheduler",
3
+ "_diffusers_version": "0.17.0.dev0",
4
+ "beta_end": 0.012,
5
+ "beta_schedule": "scaled_linear",
6
+ "beta_start": 0.00085,
7
+ "clip_sample": false,
8
+ "clip_sample_range": 1.0,
9
+ "dynamic_thresholding_ratio": 0.995,
10
+ "num_train_timesteps": 1000,
11
+ "prediction_type": "epsilon",
12
+ "sample_max_value": 1.0,
13
+ "set_alpha_to_one": false,
14
+ "skip_prk_steps": true,
15
+ "steps_offset": 1,
16
+ "thresholding": false,
17
+ "trained_betas": null
18
+ }
text_encoder/config.json ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "./models/model_scope_diffusers/",
3
+ "architectures": [
4
+ "CLIPTextModel"
5
+ ],
6
+ "attention_dropout": 0.0,
7
+ "bos_token_id": 0,
8
+ "dropout": 0.0,
9
+ "eos_token_id": 2,
10
+ "hidden_act": "gelu",
11
+ "hidden_size": 1024,
12
+ "initializer_factor": 1.0,
13
+ "initializer_range": 0.02,
14
+ "intermediate_size": 4096,
15
+ "layer_norm_eps": 1e-05,
16
+ "max_position_embeddings": 77,
17
+ "model_type": "clip_text_model",
18
+ "num_attention_heads": 16,
19
+ "num_hidden_layers": 23,
20
+ "pad_token_id": 1,
21
+ "projection_dim": 512,
22
+ "torch_dtype": "float16",
23
+ "transformers_version": "4.29.2",
24
+ "vocab_size": 49408
25
+ }
text_encoder/pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:76877281ed10a4a71f6c2aa0edd286a9e5e23a852a05d13fb05965b464a305bb
3
+ size 680904225
tokenizer/merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer/special_tokens_map.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<|startoftext|>",
4
+ "lstrip": false,
5
+ "normalized": true,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "<|endoftext|>",
11
+ "lstrip": false,
12
+ "normalized": true,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": "!",
17
+ "unk_token": {
18
+ "content": "<|endoftext|>",
19
+ "lstrip": false,
20
+ "normalized": true,
21
+ "rstrip": false,
22
+ "single_word": false
23
+ }
24
+ }
tokenizer/tokenizer_config.json ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "bos_token": {
4
+ "__type": "AddedToken",
5
+ "content": "<|startoftext|>",
6
+ "lstrip": false,
7
+ "normalized": true,
8
+ "rstrip": false,
9
+ "single_word": false
10
+ },
11
+ "clean_up_tokenization_spaces": true,
12
+ "do_lower_case": true,
13
+ "eos_token": {
14
+ "__type": "AddedToken",
15
+ "content": "<|endoftext|>",
16
+ "lstrip": false,
17
+ "normalized": true,
18
+ "rstrip": false,
19
+ "single_word": false
20
+ },
21
+ "errors": "replace",
22
+ "model_max_length": 77,
23
+ "pad_token": "<|endoftext|>",
24
+ "tokenizer_class": "CLIPTokenizer",
25
+ "unk_token": {
26
+ "__type": "AddedToken",
27
+ "content": "<|endoftext|>",
28
+ "lstrip": false,
29
+ "normalized": true,
30
+ "rstrip": false,
31
+ "single_word": false
32
+ }
33
+ }
tokenizer/vocab.json ADDED
The diff for this file is too large to render. See raw diff
 
unet/config.json ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "GroundedUNet3DConditionModel",
3
+ "_diffusers_version": "0.17.0.dev0",
4
+ "_name_or_path": "/home/tony/text-to-video-lvd-zs-1.7b/unet",
5
+ "act_fn": "silu",
6
+ "attention_head_dim": 64,
7
+ "block_out_channels": [
8
+ 320,
9
+ 640,
10
+ 1280,
11
+ 1280
12
+ ],
13
+ "cross_attention_dim": 1024,
14
+ "attention_type": "gated",
15
+ "down_block_types": [
16
+ "CrossAttnDownBlock3D",
17
+ "CrossAttnDownBlock3D",
18
+ "CrossAttnDownBlock3D",
19
+ "DownBlock3D"
20
+ ],
21
+ "downsample_padding": 1,
22
+ "in_channels": 4,
23
+ "layers_per_block": 2,
24
+ "mid_block_scale_factor": 1,
25
+ "norm_eps": 1e-05,
26
+ "norm_num_groups": 32,
27
+ "out_channels": 4,
28
+ "sample_size": 32,
29
+ "up_block_types": [
30
+ "UpBlock3D",
31
+ "CrossAttnUpBlock3D",
32
+ "CrossAttnUpBlock3D",
33
+ "CrossAttnUpBlock3D"
34
+ ]
35
+ }
unet/diffusion_pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:69d5422e6f3080caf390e2cf209941c20e21632cb68b724c8987588f4e8491c6
3
+ size 3248197593
unet/lvd_unet_3d_condition.py ADDED
The diff for this file is too large to render. See raw diff
 
vae/config.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "AutoencoderKL",
3
+ "_diffusers_version": "0.17.0.dev0",
4
+ "_name_or_path": "./models/model_scope_diffusers/",
5
+ "act_fn": "silu",
6
+ "block_out_channels": [
7
+ 128,
8
+ 256,
9
+ 512,
10
+ 512
11
+ ],
12
+ "down_block_types": [
13
+ "DownEncoderBlock2D",
14
+ "DownEncoderBlock2D",
15
+ "DownEncoderBlock2D",
16
+ "DownEncoderBlock2D"
17
+ ],
18
+ "in_channels": 3,
19
+ "latent_channels": 4,
20
+ "layers_per_block": 2,
21
+ "norm_num_groups": 32,
22
+ "out_channels": 3,
23
+ "sample_size": 512,
24
+ "scaling_factor": 0.18215,
25
+ "up_block_types": [
26
+ "UpDecoderBlock2D",
27
+ "UpDecoderBlock2D",
28
+ "UpDecoderBlock2D",
29
+ "UpDecoderBlock2D"
30
+ ]
31
+ }
vae/diffusion_pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8b0d11ff25d00ceaa02f602831d9cfe650509fdc850c0a1bcb2acdfa03bd5d56
3
+ size 167407857