microsoft
/

Reducio-VAE

Video-Generation

Model card Files Files and versions Community

Reducio-VAE / README.md

daiqi's picture

Update README.md

dfa2f96 verified about 20 hours ago

|

history blame contribute delete

2.48 kB

	---
	license: mit
	tags:
	- VAE
	- Video-Generation
	---

	# Reducio-VAE Model Card

	<!-- Provide a quick summary of what the model is/does. -->
	This model is a 3D VAE that encodes video into a compact latent space conditioned on a content frame. It compresses a video by a factor of \\(\frac{T}{4}\times\frac{H}{32}\times\frac{W}{32}\\), enabling 4096x downsampling.
	It is part of the [Reducio-DiT](https://arxiv.org/abs/2411.13552), which is a video generation method. Codebase available [here](https://github.com/microsoft/Reducio-VAE).


	## Model Details

	### Model Sources

	<!-- Provide the basic links for the model. -->

	- Repository: [GitHub Repository](https://github.com/microsoft/Reducio-VAE)
	- Paper: [arXiv](https://arxiv.org/abs/2411.13552)

	## Uses

	<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->

	Common use scenario is described [here](https://github.com/microsoft/Reducio-VAE).

	### Direct Use

	<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->

	The model is typically used for supporting training a video diffusion model. After using this model to convert the data to the latent space, you can train your own diffusion model on the extremely compressed latent space.


	## Results

	<!-- This section describes the evaluation protocols and provides the results. -->


	### Results

	Metrics on 1K Pexels validation set and UCF-101:

	\|Method\|Downsample Factor\|\\|z\\|\|PSNR \|SSIM \|LPIPS \|rFVD (Pexels)\|rFVD (UCF-101)\|
	\|---------\|---------------------\|------------------\|------------\|--------------------\|--------------\|----------------\|------------\|
	\|SD2.1-VAE\|1\8\8\|4\|29.23\|0.82\|0.09\|25.96\|21.00\|
	\|SDXL-VAE\|1\8\8\|16\|30.54\|0.85\|0.08\|19.87\|23.68\|
	\|OmniTokenizer\|4\8\8\|8\|27.11\|0.89\|0.07\|23.88\|30.52\|
	\|OpenSora-1.2\|4\8\8\|16\|30.72\|0.85\|0.11\|60.88\|67.52\|
	\|Cosmos Tokenizer\|8\8\8\|16\|30.84\|0.74\|0.12\|29.44\|22.06\|
	\|Cosmos Tokenizer\|8\16\16\|16\|28.14\|0.65\|0.18\|77.87\|119.37\|
	\|Reducio-VAE\|4\32\32\|16\|35.88\|0.94\|0.05\|17.88\|65.17\|


	## Citation

	BibTeX:

	```
	@article{tian2024reducio,
	title={REDUCIO! Generating 1024*1024 Video within 16 Seconds using Extremely Compressed Motion Latents},
	author={Tian, Rui and Dai, Qi and Bao, Jianmin and Qiu, Kai and Yang, Yifan and Luo, Chong and Wu, Zuxuan and Jiang, Yu-Gang},
	journal={arXiv preprint arXiv:2411.13552},
	year={2024}
	}
	```