Reducio-VAE / README.md
daiqi's picture
Update README.md
dfa2f96 verified
---
license: mit
tags:
- VAE
- Video-Generation
---
# Reducio-VAE Model Card
<!-- Provide a quick summary of what the model is/does. -->
This model is a 3D VAE that encodes video into a compact latent space conditioned on a content frame. It compresses a video by a factor of \\(\frac{T}{4}\times\frac{H}{32}\times\frac{W}{32}\\), enabling 4096x downsampling.
It is part of the [Reducio-DiT](https://arxiv.org/abs/2411.13552), which is a video generation method. Codebase available [here](https://github.com/microsoft/Reducio-VAE).
## Model Details
### Model Sources
<!-- Provide the basic links for the model. -->
- **Repository:** [GitHub Repository](https://github.com/microsoft/Reducio-VAE)
- **Paper:** [arXiv](https://arxiv.org/abs/2411.13552)
## Uses
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
Common use scenario is described [here](https://github.com/microsoft/Reducio-VAE).
### Direct Use
<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
The model is typically used for supporting training a video diffusion model. After using this model to convert the data to the latent space, you can train your own diffusion model on the extremely compressed latent space.
## Results
<!-- This section describes the evaluation protocols and provides the results. -->
### Results
Metrics on 1K Pexels validation set and UCF-101:
|Method|Downsample Factor|\|z\||PSNR |SSIM |LPIPS |rFVD (Pexels)|rFVD (UCF-101)|
|---------|---------------------|------------------|------------|--------------------|--------------|----------------|------------|
|SD2.1-VAE|1\*8\*8|4|29.23|0.82|0.09|25.96|21.00|
|SDXL-VAE|1\*8\*8|16|30.54|0.85|0.08|19.87|23.68|
|OmniTokenizer|4\*8\*8|8|27.11|0.89|0.07|23.88|30.52|
|OpenSora-1.2|4\*8\*8|16|30.72|0.85|0.11|60.88|67.52|
|Cosmos Tokenizer|8\*8\*8|16|30.84|0.74|0.12|29.44|22.06|
|Cosmos Tokenizer|8\*16\*16|16|28.14|0.65|0.18|77.87|119.37|
|Reducio-VAE|4\*32\*32|16|35.88|0.94|0.05|17.88|65.17|
## Citation
**BibTeX:**
```
@article{tian2024reducio,
title={REDUCIO! Generating 1024*1024 Video within 16 Seconds using Extremely Compressed Motion Latents},
author={Tian, Rui and Dai, Qi and Bao, Jianmin and Qiu, Kai and Yang, Yifan and Luo, Chong and Wu, Zuxuan and Jiang, Yu-Gang},
journal={arXiv preprint arXiv:2411.13552},
year={2024}
}
```