|
--- |
|
license: mit |
|
tags: |
|
- VAE |
|
- Video-Generation |
|
--- |
|
|
|
# Reducio-VAE Model Card |
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
This model is a 3D VAE that encodes video into a compact latent space conditioned on a content frame. It compresses a video by a factor of \\(\frac{T}{4}\times\frac{H}{32}\times\frac{W}{32}\\), enabling 4096x downsampling. |
|
It is part of the [Reducio-DiT](https://arxiv.org/abs/2411.13552), which is a video generation method. Codebase available [here](https://github.com/microsoft/Reducio-VAE). |
|
|
|
|
|
## Model Details |
|
|
|
### Model Sources |
|
|
|
<!-- Provide the basic links for the model. --> |
|
|
|
- **Repository:** [GitHub Repository](https://github.com/microsoft/Reducio-VAE) |
|
- **Paper:** [arXiv](https://arxiv.org/abs/2411.13552) |
|
|
|
## Uses |
|
|
|
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. --> |
|
|
|
Common use scenario is described [here](https://github.com/microsoft/Reducio-VAE). |
|
|
|
### Direct Use |
|
|
|
<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. --> |
|
|
|
The model is typically used for supporting training a video diffusion model. After using this model to convert the data to the latent space, you can train your own diffusion model on the extremely compressed latent space. |
|
|
|
|
|
## Results |
|
|
|
<!-- This section describes the evaluation protocols and provides the results. --> |
|
|
|
|
|
### Results |
|
|
|
Metrics on 1K Pexels validation set and UCF-101: |
|
|
|
|Method|Downsample Factor|\|z\||PSNR |SSIM |LPIPS |rFVD (Pexels)|rFVD (UCF-101)| |
|
|---------|---------------------|------------------|------------|--------------------|--------------|----------------|------------| |
|
|SD2.1-VAE|1\*8\*8|4|29.23|0.82|0.09|25.96|21.00| |
|
|SDXL-VAE|1\*8\*8|16|30.54|0.85|0.08|19.87|23.68| |
|
|OmniTokenizer|4\*8\*8|8|27.11|0.89|0.07|23.88|30.52| |
|
|OpenSora-1.2|4\*8\*8|16|30.72|0.85|0.11|60.88|67.52| |
|
|Cosmos Tokenizer|8\*8\*8|16|30.84|0.74|0.12|29.44|22.06| |
|
|Cosmos Tokenizer|8\*16\*16|16|28.14|0.65|0.18|77.87|119.37| |
|
|Reducio-VAE|4\*32\*32|16|35.88|0.94|0.05|17.88|65.17| |
|
|
|
|
|
## Citation |
|
|
|
**BibTeX:** |
|
|
|
``` |
|
@article{tian2024reducio, |
|
title={REDUCIO! Generating 1024*1024 Video within 16 Seconds using Extremely Compressed Motion Latents}, |
|
author={Tian, Rui and Dai, Qi and Bao, Jianmin and Qiu, Kai and Yang, Yifan and Luo, Chong and Wu, Zuxuan and Jiang, Yu-Gang}, |
|
journal={arXiv preprint arXiv:2411.13552}, |
|
year={2024} |
|
} |
|
``` |