metadata
license: mit
tags:
- VAE
- Video-Generation
Reducio-VAE Model Card
This model is a 3D VAE that encodes video into a compact latent space conditioned on a content frame. It compresses a video by a factor of , enabling 4096x downsampling. It is part of the Reducio-DiT, which is a video generation method. Codebase available here.
Model Details
Model Sources
- Repository: GitHub Repository
- Paper: arXiv
Uses
Common use scenario is described here.
Direct Use
The model is typically used for supporting training a video diffusion model. After using this model to convert the data to the latent space, you can train your own diffusion model on the extremely compressed latent space.
Results
Results
Metrics on 1K Pexels validation set and UCF-101:
Method | Downsample Factor | |z| | PSNR | SSIM | LPIPS | rFVD (Pexels) | rFVD (UCF-101) |
---|---|---|---|---|---|---|---|
SD2.1-VAE | 1*8*8 | 4 | 29.23 | 0.82 | 0.09 | 25.96 | 21.00 |
SDXL-VAE | 1*8*8 | 16 | 30.54 | 0.85 | 0.08 | 19.87 | 23.68 |
OmniTokenizer | 4*8*8 | 8 | 27.11 | 0.89 | 0.07 | 23.88 | 30.52 |
OpenSora-1.2 | 4*8*8 | 16 | 30.72 | 0.85 | 0.11 | 60.88 | 67.52 |
Cosmos Tokenizer | 8*8*8 | 16 | 30.84 | 0.74 | 0.12 | 29.44 | 22.06 |
Cosmos Tokenizer | 8*16*16 | 16 | 28.14 | 0.65 | 0.18 | 77.87 | 119.37 |
Reducio-VAE | 4*32*32 | 16 | 35.88 | 0.94 | 0.05 | 17.88 | 65.17 |
Citation
BibTeX:
@article{tian2024reducio,
title={REDUCIO! Generating 1024*1024 Video within 16 Seconds using Extremely Compressed Motion Latents},
author={Tian, Rui and Dai, Qi and Bao, Jianmin and Qiu, Kai and Yang, Yifan and Luo, Chong and Wu, Zuxuan and Jiang, Yu-Gang},
journal={arXiv preprint arXiv:2411.13552},
year={2024}
}