rain1011
/

pyramid-flow-sd3

@@ -11,7 +11,7 @@ tags:
 # ⚡️Pyramid Flow⚡️
-[[Paper]](https://arxiv.org/abs/2410.05954) [[Project Page ✨]](https://pyramid-flow.github.io) [[Code 🚀]](https://github.com/jy0205/Pyramid-Flow)
 This is the official repository for Pyramid Flow, a training-efficient **Autoregressive Video Generation** method based on **Flow Matching**. By training only on open-source datasets, it generates high-quality 10-second videos at 768p resolution and 24 FPS, and naturally supports image-to-video generation.
@@ -31,11 +31,24 @@ This is the official repository for Pyramid Flow, a training-efficient **Autoreg
 ## News
 * `COMING SOON` ⚡️⚡️⚡️ Training code and new model checkpoints trained from scratch.
 * `2024.10.10`  🚀🚀🚀 We release the [technical report](https://arxiv.org/abs/2410.05954), [project page](https://pyramid-flow.github.io) and [model checkpoint](https://huggingface.co/rain1011/pyramid-flow-sd3) of Pyramid Flow.
-## Usage
-You can directly download the model from [Huggingface](https://huggingface.co/rain1011/pyramid-flow-sd3). We provide both model checkpoints for 768p and 384p video generation. The 384p checkpoint supports 5-second video generation at 24FPS, while the 768p checkpoint supports up to 10-second video generation at 24FPS.
 ```python
 from huggingface_hub import snapshot_download
@@ -44,6 +57,8 @@ model_path = 'PATH'   # The local directory to save downloaded checkpoint
 snapshot_download("rain1011/pyramid-flow-sd3", local_dir=model_path, local_dir_use_symlinks=False, repo_type='model')
 ```
 To use our model, please follow the inference code in `video_generation_demo.ipynb` at [this link](https://github.com/jy0205/Pyramid-Flow/blob/main/video_generation_demo.ipynb). We further simplify it into the following two-step procedure. First, load the downloaded model:
 ```python
@@ -53,7 +68,7 @@ from pyramid_dit import PyramidDiTForVideoGeneration
 from diffusers.utils import load_image, export_to_video
 torch.cuda.set_device(0)
-model_dtype, torch_dtype = 'bf16', torch.bfloat16   # Use bf16, fp16 or fp32
 model = PyramidDiTForVideoGeneration(
     'PATH',                                         # The downloaded checkpoint dir
@@ -80,9 +95,10 @@ with torch.no_grad(), torch.cuda.amp.autocast(enabled=True, dtype=torch_dtype):
         height=768,
         width=1280,
         temp=16,                    # temp=16: 5s, temp=31: 10s
-        guidance_scale=9.0,         # The guidance for the first frame
         video_guidance_scale=5.0,   # The guidance for the other video latent
         output_type="pil",
     )
 export_to_video(frames, "./text_to_video_sample.mp4", fps=24)
@@ -102,12 +118,15 @@ with torch.no_grad(), torch.cuda.amp.autocast(enabled=True, dtype=torch_dtype):
         temp=16,
         video_guidance_scale=4.0,
         output_type="pil",
     )
 export_to_video(frames, "./image_to_video_sample.mp4", fps=24)
 ```
-Usage tips:
 * The `guidance_scale` parameter controls the visual quality. We suggest using a guidance within [7, 9] for the 768p checkpoint during text-to-video generation, and 7 for the 384p checkpoint.
 * The `video_guidance_scale` parameter controls the motion. A larger value increases the dynamic degree and mitigates the autoregressive generation degradation, while a smaller value stabilizes the video.

 # ⚡️Pyramid Flow⚡️
+[[Paper]](https://arxiv.org/abs/2410.05954) [[Project Page ✨]](https://pyramid-flow.github.io) [[Code 🚀]](https://github.com/jy0205/Pyramid-Flow) [[demo 🤗](https://huggingface.co/spaces/Pyramid-Flow/pyramid-flow)]
 This is the official repository for Pyramid Flow, a training-efficient **Autoregressive Video Generation** method based on **Flow Matching**. By training only on open-source datasets, it generates high-quality 10-second videos at 768p resolution and 24 FPS, and naturally supports image-to-video generation.
 ## News
 * `COMING SOON` ⚡️⚡️⚡️ Training code and new model checkpoints trained from scratch.
+* `2024.10.11`  🤗🤗🤗 [Hugging Face demo](https://huggingface.co/spaces/Pyramid-Flow/pyramid-flow) is available. Thanks [@multimodalart](https://huggingface.co/multimodalart) for the commit!
 * `2024.10.10`  🚀🚀🚀 We release the [technical report](https://arxiv.org/abs/2410.05954), [project page](https://pyramid-flow.github.io) and [model checkpoint](https://huggingface.co/rain1011/pyramid-flow-sd3) of Pyramid Flow.
+## Installation
+We recommend setting up the environment with conda. The codebase currently uses Python 3.8.10 and PyTorch 2.1.2, and we are actively working to support a wider range of versions.
+```bash
+git clone https://github.com/jy0205/Pyramid-Flow
+cd Pyramid-Flow
+# create env using conda
+conda create -n pyramid python==3.8.10
+conda activate pyramid
+pip install -r requirements.txt
+```
+Then, you can directly download the model from [Huggingface](https://huggingface.co/rain1011/pyramid-flow-sd3). We provide both model checkpoints for 768p and 384p video generation. The 384p checkpoint supports 5-second video generation at 24FPS, while the 768p checkpoint supports up to 10-second video generation at 24FPS.
 ```python
 from huggingface_hub import snapshot_download
 snapshot_download("rain1011/pyramid-flow-sd3", local_dir=model_path, local_dir_use_symlinks=False, repo_type='model')
 ```
+## Usage
 To use our model, please follow the inference code in `video_generation_demo.ipynb` at [this link](https://github.com/jy0205/Pyramid-Flow/blob/main/video_generation_demo.ipynb). We further simplify it into the following two-step procedure. First, load the downloaded model:
 ```python
 from diffusers.utils import load_image, export_to_video
 torch.cuda.set_device(0)
+model_dtype, torch_dtype = 'bf16', torch.bfloat16   # Use bf16 (not support fp16 yet)
 model = PyramidDiTForVideoGeneration(
     'PATH',                                         # The downloaded checkpoint dir
         height=768,
         width=1280,
         temp=16,                    # temp=16: 5s, temp=31: 10s
+        guidance_scale=9.0,         # The guidance for the first frame, set it to 7 for 384p variant
         video_guidance_scale=5.0,   # The guidance for the other video latent
         output_type="pil",
+        save_memory=True,           # If you have enough GPU memory, set it to `False` to improve vae decoding speed
     )
 export_to_video(frames, "./text_to_video_sample.mp4", fps=24)
         temp=16,
         video_guidance_scale=4.0,
         output_type="pil",
+        save_memory=True,           # If you have enough GPU memory, set it to `False` to improve vae decoding speed
     )
 export_to_video(frames, "./image_to_video_sample.mp4", fps=24)
 ```
+We also support CPU offloading to allow inference with **less than 12GB** of GPU memory by adding a `cpu_offloading=True` parameter. This feature was contributed by [@Ednaordinary](https://github.com/Ednaordinary), see [#23](https://github.com/jy0205/Pyramid-Flow/pull/23) for details.
+## Usage tips
 * The `guidance_scale` parameter controls the visual quality. We suggest using a guidance within [7, 9] for the 768p checkpoint during text-to-video generation, and 7 for the 384p checkpoint.
 * The `video_guidance_scale` parameter controls the motion. A larger value increases the dynamic degree and mitigates the autoregressive generation degradation, while a smaller value stabilizes the video.