1170300521 commited on
Commit
45e5b01
1 Parent(s): c766ff7

add readme

Browse files
Files changed (2) hide show
  1. GITHUB_README.md +162 -0
  2. README.md +13 -155
GITHUB_README.md ADDED
@@ -0,0 +1,162 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ControlVideo
2
+
3
+ Official pytorch implementation of "ControlVideo: Training-free Controllable Text-to-Video Generation"
4
+
5
+ [![arXiv](https://img.shields.io/badge/arXiv-2305.13077-b31b1b.svg)](https://arxiv.org/abs/2305.13077)
6
+ ![visitors](https://visitor-badge.laobi.icu/badge?page_id=YBYBZhang/ControlVideo)
7
+ [![Replicate](https://replicate.com/cjwbw/controlvideo/badge)](https://replicate.com/cjwbw/controlvideo)
8
+
9
+ <p align="center">
10
+ <img src="assets/overview.png" width="1080px"/>
11
+ <br>
12
+ <em>ControlVideo adapts ControlNet to the video counterpart without any finetuning, aiming to directly inherit its high-quality and consistent generation </em>
13
+ </p>
14
+
15
+ ## News
16
+ * [07/11/2023] Support [ControlNet 1.1](https://github.com/lllyasviel/ControlNet-v1-1-nightly) based version!
17
+ * [05/28/2023] Thanks [chenxwh](https://github.com/chenxwh), add a [Replicate demo](https://replicate.com/cjwbw/controlvideo)!
18
+ * [05/25/2023] Code [ControlVideo](https://github.com/YBYBZhang/ControlVideo/) released!
19
+ * [05/23/2023] Paper [ControlVideo](https://arxiv.org/abs/2305.13077) released!
20
+
21
+ ## Setup
22
+
23
+ ### 1. Download Weights
24
+ All pre-trained weights are downloaded to `checkpoints/` directory, including the pre-trained weights of [Stable Diffusion v1.5](https://huggingface.co/runwayml/stable-diffusion-v1-5), ControlNet 1.0 conditioned on [canny edges](https://huggingface.co/lllyasviel/sd-controlnet-canny), [depth maps](https://huggingface.co/lllyasviel/sd-controlnet-depth), [human poses](https://huggingface.co/lllyasviel/sd-controlnet-openpose), and ControlNet 1.1 in [here](https://huggingface.co/lllyasviel).
25
+ The `flownet.pkl` is the weights of [RIFE](https://github.com/megvii-research/ECCV2022-RIFE).
26
+ The final file tree likes:
27
+
28
+ ```none
29
+ checkpoints
30
+ ├── stable-diffusion-v1-5
31
+ ├── sd-controlnet-canny
32
+ ├── sd-controlnet-depth
33
+ ├── sd-controlnet-openpose
34
+ ├── ...
35
+ ├── flownet.pkl
36
+ ```
37
+ ### 2. Requirements
38
+
39
+ ```shell
40
+ conda create -n controlvideo python=3.10
41
+ conda activate controlvideo
42
+ pip install -r requirements.txt
43
+ ```
44
+ Note: `xformers` is recommended to save memory and running time. `controlnet-aux` is updated to version 0.0.6.
45
+
46
+ ## Inference
47
+
48
+ To perform text-to-video generation, just run this command in `inference.sh`:
49
+ ```bash
50
+ python inference.py \
51
+ --prompt "A striking mallard floats effortlessly on the sparkling pond." \
52
+ --condition "depth" \
53
+ --video_path "data/mallard-water.mp4" \
54
+ --output_path "outputs/" \
55
+ --video_length 15 \
56
+ --smoother_steps 19 20 \
57
+ --width 512 \
58
+ --height 512 \
59
+ --frame_rate 2 \
60
+ --version v10 \
61
+ # --is_long_video
62
+ ```
63
+ where `--video_length` is the length of synthesized video, `--condition` represents the type of structure sequence,
64
+ `--smoother_steps` determines at which timesteps to perform smoothing, `--version` selects the version of ControlNet (e.g., `v10` or `v11`), and `--is_long_video` denotes whether to enable efficient long-video synthesis.
65
+
66
+ ## Visualizations
67
+
68
+ ### ControlVideo on depth maps
69
+
70
+ <table class="center">
71
+ <tr>
72
+ <td width=30% align="center"><img src="assets/depth/A_charming_flamingo_gracefully_wanders_in_the_calm_and_serene_water,_its_delicate_neck_curving_into_an_elegant_shape..gif" raw=true></td>
73
+ <td width=30% align="center"><img src="assets/depth/A_striking_mallard_floats_effortlessly_on_the_sparkling_pond..gif" raw=true></td>
74
+ <td width=30% align="center"><img src="assets/depth/A_gigantic_yellow_jeep_slowly_turns_on_a_wide,_smooth_road_in_the_city..gif" raw=true></td>
75
+ </tr>
76
+ <tr>
77
+ <td width=30% align="center">"A charming flamingo gracefully wanders in the calm and serene water, its delicate neck curving into an elegant shape."</td>
78
+ <td width=30% align="center">"A striking mallard floats effortlessly on the sparkling pond."</td>
79
+ <td width=30% align="center">"A gigantic yellow jeep slowly turns on a wide, smooth road in the city."</td>
80
+ </tr>
81
+ <tr>
82
+ <td width=30% align="center"><img src="assets/depth/A_sleek_boat_glides_effortlessly_through_the_shimmering_river,_van_gogh_style..gif" raw=true></td>
83
+ <td width=30% align="center"><img src="assets/depth/A_majestic_sailing_boat_cruises_along_the_vast,_azure_sea..gif" raw=true></td>
84
+ <td width=30% align="center"><img src="assets/depth/A_contented_cow_ambles_across_the_dewy,_verdant_pasture..gif" raw=true></td>
85
+ </tr>
86
+ <tr>
87
+ <td width=30% align="center">"A sleek boat glides effortlessly through the shimmering river, van gogh style."</td>
88
+ <td width=30% align="center">"A majestic sailing boat cruises along the vast, azure sea."</td>
89
+ <td width=30% align="center">"A contented cow ambles across the dewy, verdant pasture."</td>
90
+ </tr>
91
+ </table>
92
+
93
+ ### ControlVideo on canny edges
94
+
95
+ <table class="center">
96
+ <tr>
97
+ <td width=30% align="center"><img src="assets/canny/A_young_man_riding_a_sleek,_black_motorbike_through_the_winding_mountain_roads..gif" raw=true></td>
98
+ <td width=30% align="center"><img src="assets/canny/A_white_swan_moving_on_the_lake,_cartoon_style..gif" raw=true></td>
99
+ <td width=30% align="center"><img src="assets/canny/A_dusty_old_jeep_was_making_its_way_down_the_winding_forest_road,_creaking_and_groaning_with_each_bump_and_turn..gif" raw=true></td>
100
+ </tr>
101
+ <tr>
102
+ <td width=30% align="center">"A young man riding a sleek, black motorbike through the winding mountain roads."</td>
103
+ <td width=30% align="center">"A white swan movingon the lake, cartoon style."</td>
104
+ <td width=30% align="center">"A dusty old jeep was making its way down the winding forest road, creaking and groaning with each bump and turn."</td>
105
+ </tr>
106
+ <tr>
107
+ <td width=30% align="center"><img src="assets/canny/A_shiny_red_jeep_smoothly_turns_on_a_narrow,_winding_road_in_the_mountains..gif" raw=true></td>
108
+ <td width=30% align="center"><img src="assets/canny/A_majestic_camel_gracefully_strides_across_the_scorching_desert_sands..gif" raw=true></td>
109
+ <td width=30% align="center"><img src="assets/canny/A_fit_man_is_leisurely_hiking_through_a_lush_and_verdant_forest..gif" raw=true></td>
110
+ </tr>
111
+ <tr>
112
+ <td width=30% align="center">"A shiny red jeep smoothly turns on a narrow, winding road in the mountains."</td>
113
+ <td width=30% align="center">"A majestic camel gracefully strides across the scorching desert sands."</td>
114
+ <td width=30% align="center">"A fit man is leisurely hiking through a lush and verdant forest."</td>
115
+ </tr>
116
+ </table>
117
+
118
+
119
+ ### ControlVideo on human poses
120
+
121
+ <table class="center">
122
+ <tr>
123
+ <td width=25% align="center"><img src="assets/pose/James_bond_moonwalk_on_the_beach,_animation_style.gif" raw=true></td>
124
+ <td width=25% align="center"><img src="assets/pose/Goku_in_a_mountain_range,_surreal_style..gif" raw=true></td>
125
+ <td width=25% align="center"><img src="assets/pose/Hulk_is_jumping_on_the_street,_cartoon_style.gif" raw=true></td>
126
+ <td width=25% align="center"><img src="assets/pose/A_robot_dances_on_a_road,_animation_style.gif" raw=true></td>
127
+ </tr>
128
+ <tr>
129
+ <td width=25% align="center">"James bond moonwalk on the beach, animation style."</td>
130
+ <td width=25% align="center">"Goku in a mountain range, surreal style."</td>
131
+ <td width=25% align="center">"Hulk is jumping on the street, cartoon style."</td>
132
+ <td width=25% align="center">"A robot dances on a road, animation style."</td>
133
+ </tr></table>
134
+
135
+ ### Long video generation
136
+
137
+ <table class="center">
138
+ <tr>
139
+ <td width=60% align="center"><img src="assets/long/A_steamship_on_the_ocean,_at_sunset,_sketch_style.gif" raw=true></td>
140
+ <td width=40% align="center"><img src="assets/long/Hulk_is_dancing_on_the_beach,_cartoon_style.gif" raw=true></td>
141
+ </tr>
142
+ <tr>
143
+ <td width=60% align="center">"A steamship on the ocean, at sunset, sketch style."</td>
144
+ <td width=40% align="center">"Hulk is dancing on the beach, cartoon style."</td>
145
+ </tr>
146
+ </table>
147
+
148
+ ## Citation
149
+ If you make use of our work, please cite our paper.
150
+ ```bibtex
151
+ @article{zhang2023controlvideo,
152
+ title={ControlVideo: Training-free Controllable Text-to-Video Generation},
153
+ author={Zhang, Yabo and Wei, Yuxiang and Jiang, Dongsheng and Zhang, Xiaopeng and Zuo, Wangmeng and Tian, Qi},
154
+ journal={arXiv preprint arXiv:2305.13077},
155
+ year={2023}
156
+ }
157
+ ```
158
+
159
+ ## Acknowledgement
160
+ This work repository borrows heavily from [Diffusers](https://github.com/huggingface/diffusers), [ControlNet](https://github.com/lllyasviel/ControlNet), [Tune-A-Video](https://github.com/showlab/Tune-A-Video), and [RIFE](https://github.com/megvii-research/ECCV2022-RIFE).
161
+
162
+ There are also many interesting works on video generation: [Tune-A-Video](https://github.com/showlab/Tune-A-Video), [Text2Video-Zero](https://github.com/Picsart-AI-Research/Text2Video-Zero), [Follow-Your-Pose](https://github.com/mayuelala/FollowYourPose), [Control-A-Video](https://github.com/Weifeng-Chen/control-a-video), et al.
README.md CHANGED
@@ -1,162 +1,20 @@
1
- # ControlVideo
2
-
3
- Official pytorch implementation of "ControlVideo: Training-free Controllable Text-to-Video Generation"
4
-
5
- [![arXiv](https://img.shields.io/badge/arXiv-2305.13077-b31b1b.svg)](https://arxiv.org/abs/2305.13077)
6
- ![visitors](https://visitor-badge.laobi.icu/badge?page_id=YBYBZhang/ControlVideo)
7
- [![Replicate](https://replicate.com/cjwbw/controlvideo/badge)](https://replicate.com/cjwbw/controlvideo)
8
-
9
- <p align="center">
10
- <img src="assets/overview.png" width="1080px"/>
11
- <br>
12
- <em>ControlVideo adapts ControlNet to the video counterpart without any finetuning, aiming to directly inherit its high-quality and consistent generation </em>
13
- </p>
14
-
15
- ## News
16
- * [07/11/2023] Support [ControlNet 1.1](https://github.com/lllyasviel/ControlNet-v1-1-nightly) based version!
17
- * [05/28/2023] Thanks [chenxwh](https://github.com/chenxwh), add a [Replicate demo](https://replicate.com/cjwbw/controlvideo)!
18
- * [05/25/2023] Code [ControlVideo](https://github.com/YBYBZhang/ControlVideo/) released!
19
- * [05/23/2023] Paper [ControlVideo](https://arxiv.org/abs/2305.13077) released!
20
-
21
- ## Setup
22
-
23
- ### 1. Download Weights
24
- All pre-trained weights are downloaded to `checkpoints/` directory, including the pre-trained weights of [Stable Diffusion v1.5](https://huggingface.co/runwayml/stable-diffusion-v1-5), ControlNet 1.0 conditioned on [canny edges](https://huggingface.co/lllyasviel/sd-controlnet-canny), [depth maps](https://huggingface.co/lllyasviel/sd-controlnet-depth), [human poses](https://huggingface.co/lllyasviel/sd-controlnet-openpose), and ControlNet 1.1 in [here](https://huggingface.co/lllyasviel).
25
- The `flownet.pkl` is the weights of [RIFE](https://github.com/megvii-research/ECCV2022-RIFE).
26
- The final file tree likes:
27
-
28
- ```none
29
- checkpoints
30
- ├── stable-diffusion-v1-5
31
- ├── sd-controlnet-canny
32
- ├── sd-controlnet-depth
33
- ├── sd-controlnet-openpose
34
- ├── ...
35
- ├── flownet.pkl
36
  ```
37
- ### 2. Requirements
38
-
39
- ```shell
40
- conda create -n controlvideo python=3.10
41
- conda activate controlvideo
42
- pip install -r requirements.txt
43
- ```
44
- Note: `xformers` is recommended to save memory and running time. `controlnet-aux` is updated to version 0.0.6.
45
-
46
- ## Inference
47
-
48
- To perform text-to-video generation, just run this command in `inference.sh`:
49
- ```bash
50
- python inference.py \
51
- --prompt "A striking mallard floats effortlessly on the sparkling pond." \
52
- --condition "depth" \
53
- --video_path "data/mallard-water.mp4" \
54
- --output_path "outputs/" \
55
- --video_length 15 \
56
- --smoother_steps 19 20 \
57
- --width 512 \
58
- --height 512 \
59
- --frame_rate 2 \
60
- --version v10 \
61
- # --is_long_video
62
- ```
63
- where `--video_length` is the length of synthesized video, `--condition` represents the type of structure sequence,
64
- `--smoother_steps` determines at which timesteps to perform smoothing, `--version` selects the version of ControlNet (e.g., `v10` or `v11`), and `--is_long_video` denotes whether to enable efficient long-video synthesis.
65
-
66
- ## Visualizations
67
-
68
- ### ControlVideo on depth maps
69
-
70
- <table class="center">
71
- <tr>
72
- <td width=30% align="center"><img src="assets/depth/A_charming_flamingo_gracefully_wanders_in_the_calm_and_serene_water,_its_delicate_neck_curving_into_an_elegant_shape..gif" raw=true></td>
73
- <td width=30% align="center"><img src="assets/depth/A_striking_mallard_floats_effortlessly_on_the_sparkling_pond..gif" raw=true></td>
74
- <td width=30% align="center"><img src="assets/depth/A_gigantic_yellow_jeep_slowly_turns_on_a_wide,_smooth_road_in_the_city..gif" raw=true></td>
75
- </tr>
76
- <tr>
77
- <td width=30% align="center">"A charming flamingo gracefully wanders in the calm and serene water, its delicate neck curving into an elegant shape."</td>
78
- <td width=30% align="center">"A striking mallard floats effortlessly on the sparkling pond."</td>
79
- <td width=30% align="center">"A gigantic yellow jeep slowly turns on a wide, smooth road in the city."</td>
80
- </tr>
81
- <tr>
82
- <td width=30% align="center"><img src="assets/depth/A_sleek_boat_glides_effortlessly_through_the_shimmering_river,_van_gogh_style..gif" raw=true></td>
83
- <td width=30% align="center"><img src="assets/depth/A_majestic_sailing_boat_cruises_along_the_vast,_azure_sea..gif" raw=true></td>
84
- <td width=30% align="center"><img src="assets/depth/A_contented_cow_ambles_across_the_dewy,_verdant_pasture..gif" raw=true></td>
85
- </tr>
86
- <tr>
87
- <td width=30% align="center">"A sleek boat glides effortlessly through the shimmering river, van gogh style."</td>
88
- <td width=30% align="center">"A majestic sailing boat cruises along the vast, azure sea."</td>
89
- <td width=30% align="center">"A contented cow ambles across the dewy, verdant pasture."</td>
90
- </tr>
91
- </table>
92
-
93
- ### ControlVideo on canny edges
94
-
95
- <table class="center">
96
- <tr>
97
- <td width=30% align="center"><img src="assets/canny/A_young_man_riding_a_sleek,_black_motorbike_through_the_winding_mountain_roads..gif" raw=true></td>
98
- <td width=30% align="center"><img src="assets/canny/A_white_swan_moving_on_the_lake,_cartoon_style..gif" raw=true></td>
99
- <td width=30% align="center"><img src="assets/canny/A_dusty_old_jeep_was_making_its_way_down_the_winding_forest_road,_creaking_and_groaning_with_each_bump_and_turn..gif" raw=true></td>
100
- </tr>
101
- <tr>
102
- <td width=30% align="center">"A young man riding a sleek, black motorbike through the winding mountain roads."</td>
103
- <td width=30% align="center">"A white swan movingon the lake, cartoon style."</td>
104
- <td width=30% align="center">"A dusty old jeep was making its way down the winding forest road, creaking and groaning with each bump and turn."</td>
105
- </tr>
106
- <tr>
107
- <td width=30% align="center"><img src="assets/canny/A_shiny_red_jeep_smoothly_turns_on_a_narrow,_winding_road_in_the_mountains..gif" raw=true></td>
108
- <td width=30% align="center"><img src="assets/canny/A_majestic_camel_gracefully_strides_across_the_scorching_desert_sands..gif" raw=true></td>
109
- <td width=30% align="center"><img src="assets/canny/A_fit_man_is_leisurely_hiking_through_a_lush_and_verdant_forest..gif" raw=true></td>
110
- </tr>
111
- <tr>
112
- <td width=30% align="center">"A shiny red jeep smoothly turns on a narrow, winding road in the mountains."</td>
113
- <td width=30% align="center">"A majestic camel gracefully strides across the scorching desert sands."</td>
114
- <td width=30% align="center">"A fit man is leisurely hiking through a lush and verdant forest."</td>
115
- </tr>
116
- </table>
117
-
118
-
119
- ### ControlVideo on human poses
120
-
121
- <table class="center">
122
- <tr>
123
- <td width=25% align="center"><img src="assets/pose/James_bond_moonwalk_on_the_beach,_animation_style.gif" raw=true></td>
124
- <td width=25% align="center"><img src="assets/pose/Goku_in_a_mountain_range,_surreal_style..gif" raw=true></td>
125
- <td width=25% align="center"><img src="assets/pose/Hulk_is_jumping_on_the_street,_cartoon_style.gif" raw=true></td>
126
- <td width=25% align="center"><img src="assets/pose/A_robot_dances_on_a_road,_animation_style.gif" raw=true></td>
127
- </tr>
128
- <tr>
129
- <td width=25% align="center">"James bond moonwalk on the beach, animation style."</td>
130
- <td width=25% align="center">"Goku in a mountain range, surreal style."</td>
131
- <td width=25% align="center">"Hulk is jumping on the street, cartoon style."</td>
132
- <td width=25% align="center">"A robot dances on a road, animation style."</td>
133
- </tr></table>
134
-
135
- ### Long video generation
136
-
137
- <table class="center">
138
- <tr>
139
- <td width=60% align="center"><img src="assets/long/A_steamship_on_the_ocean,_at_sunset,_sketch_style.gif" raw=true></td>
140
- <td width=40% align="center"><img src="assets/long/Hulk_is_dancing_on_the_beach,_cartoon_style.gif" raw=true></td>
141
- </tr>
142
- <tr>
143
- <td width=60% align="center">"A steamship on the ocean, at sunset, sketch style."</td>
144
- <td width=40% align="center">"Hulk is dancing on the beach, cartoon style."</td>
145
- </tr>
146
- </table>
147
-
148
- ## Citation
149
- If you make use of our work, please cite our paper.
150
- ```bibtex
151
  @article{zhang2023controlvideo,
152
  title={ControlVideo: Training-free Controllable Text-to-Video Generation},
153
  author={Zhang, Yabo and Wei, Yuxiang and Jiang, Dongsheng and Zhang, Xiaopeng and Zuo, Wangmeng and Tian, Qi},
154
  journal={arXiv preprint arXiv:2305.13077},
155
  year={2023}
156
  }
157
- ```
158
-
159
- ## Acknowledgement
160
- This work repository borrows heavily from [Diffusers](https://github.com/huggingface/diffusers), [ControlNet](https://github.com/lllyasviel/ControlNet), [Tune-A-Video](https://github.com/showlab/Tune-A-Video), and [RIFE](https://github.com/megvii-research/ECCV2022-RIFE).
161
-
162
- There are also many interesting works on video generation: [Tune-A-Video](https://github.com/showlab/Tune-A-Video), [Text2Video-Zero](https://github.com/Picsart-AI-Research/Text2Video-Zero), [Follow-Your-Pose](https://github.com/mayuelala/FollowYourPose), [Control-A-Video](https://github.com/Weifeng-Chen/control-a-video), et al.
 
1
+ ---
2
+ title: ControlVideo
3
+ emoji: 🦩
4
+ colorFrom: blue
5
+ colorTo: green
6
+ sdk: gradio
7
+ sdk_version: 3.36.1
8
+ app_file: app.py
9
+ pinned: false
10
+ license: mit
11
+ ---
12
+ ### Citation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
  @article{zhang2023controlvideo,
15
  title={ControlVideo: Training-free Controllable Text-to-Video Generation},
16
  author={Zhang, Yabo and Wei, Yuxiang and Jiang, Dongsheng and Zhang, Xiaopeng and Zuo, Wangmeng and Tian, Qi},
17
  journal={arXiv preprint arXiv:2305.13077},
18
  year={2023}
19
  }
20
+ ```