---
license: creativeml-openrail-m
language:
- en
library_name: diffusers
pipeline_tag: text-to-video
tags:
- AIGC
- text2video
- image2video
- infinite-length
- human
---
**MuseV: Infinite-length and High Fidelity Virtual Human Video Generation with Visual Conditioned Parallel Denoising**
Zhiqiang Xia \*,
Zhaokang Chen\*,
Bin Wu†,
Chao Li,
Kwok-Wai Hung,
Chao Zhan,
Yingjie He,
Wenjiang Zhou
(*Equal Contribution, †Corresponding Author, benbinwu@tencent.com)
Lyra Lab, Tencent Music Entertainment
**[github](https://github.com/TMElyralab/MuseV)** **[huggingface](https://huggingface.co/TMElyralab/MuseV)** **[HuggingfaceSpace](https://huggingface.co/spaces/AnchorFake/MuseVDemo)** **[project](comming soon)** **Technical report (comming soon)**
We have setup **the world simulator vision since March 2023, believing diffusion models can simulate the world**. `MuseV` was a milestone achieved around **July 2023**. Amazed by the progress of Sora, we decided to opensource `MuseV`, hopefully it will benefit the community. Next we will move on to the promising diffusion+transformer scheme.
We will soon release `MuseTalk`, a real-time high quality lip sync model, which can be applied with MuseV as a complete virtual human generation solution. Please stay tuned!
# Overview
`MuseV` is a diffusion-based virtual human video generation framework, which
1. supports **infinite length** generation using a novel **Visual Conditioned Parallel Denoising scheme**.
2. checkpoint available for virtual human video generation trained on human dataset.
3. supports Image2Video, Text2Image2Video, Video2Video.
4. compatible with the **Stable Diffusion ecosystem**, including `base_model`, `lora`, `controlnet`, etc.
5. supports multi reference image technology, including `IPAdapter`, `ReferenceOnly`, `ReferenceNet`, `IPAdapterFaceID`.
6. training codes (comming very soon).
# News
- [03/27/2024] release `MuseV` project and trained model `musev`, `muse_referencenet`, `muse_referencenet_pose`.
## Model
### overview of model structure
![model_structure](data/models/musev_structure.png)
### parallel denoising
![parallel_denoise](data/models/parallel_denoise.png)
## Cases
All frames are generated from text2video model, without any post process.
Bellow Case could be found in `configs/tasks/example.yaml`
### Text/Image2Video
#### Human
image |
video |
prompt |
|
|
(masterpiece, best quality, highres:1),(1girl, solo:1),(beautiful face,
soft skin, costume:1),(eye blinks:{eye_blinks_factor}),(head wave:1.3)
|
|
|
(masterpiece, best quality, highres:1),(1girl, solo:1),(beautiful face,
soft skin, costume:1),(eye blinks:{eye_blinks_factor}),(head wave:1.3)
|
|
|
(masterpiece, best quality, highres:1), peaceful beautiful sea scene
|
|
|
(masterpiece, best quality, highres:1), peaceful beautiful sea scene
|
|
|
(masterpiece, best quality, highres:1), playing guitar
|
|
|
(masterpiece, best quality, highres:1), playing guitar
|
|
|
(masterpiece, best quality, highres:1), playing guitar
|
|
|
(masterpiece, best quality, highres:1), playing guitar
|
|
|
(masterpiece, best quality, highres:1),(1man, solo:1),(beautiful face,
soft skin, costume:1),(eye blinks:{eye_blinks_factor}),(head wave:1.3)
|
|
|
(masterpiece, best quality, highres:1),(1girl, solo:1),(beautiful face,
soft skin, costume:1),(eye blinks:{eye_blinks_factor}),(head wave:1.3)
|
|
|
(masterpiece, best quality, highres:1),(1man, solo:1),(beautiful face,
soft skin, costume:1),(eye blinks:{eye_blinks_factor}),(head wave:1.3)
|
|
|
(masterpiece, best quality, highres:1),(1girl, solo:1),(beautiful face,
soft skin, costume:1),(eye blinks:{eye_blinks_factor}),(head wave:1.3)
|
|
|
(masterpiece, best quality, highres:1),(1girl, solo:1),(beautiful face,
soft skin, costume:1),(eye blinks:{eye_blinks_factor}),(head wave:1.3)
|
#### scene