--- license: mit tags: - VAE - Video-Generation --- # Reducio-VAE Model Card This model is a 3D VAE that encodes video into a compact latent space conditioned on a content frame. It compresses a video by a factor of \\(\frac{T}{4}\times\frac{H}{32}\times\frac{W}{32}\\), enabling 4096x downsampling. It is part of the [Reducio-DiT](https://arxiv.org/abs/2411.13552), which is a video generation method. Codebase available [here](https://github.com/microsoft/Reducio-VAE). ## Model Details ### Model Sources - **Repository:** [GitHub Repository](https://github.com/microsoft/Reducio-VAE) - **Paper:** [arXiv](https://arxiv.org/abs/2411.13552) ## Uses Common use scenario is described [here](https://github.com/microsoft/Reducio-VAE). ### Direct Use The model is typically used for supporting training a video diffusion model. After using this model to convert the data to the latent space, you can train your own diffusion model on the extremely compressed latent space. ## Results ### Results Metrics on 1K Pexels validation set and UCF-101: |Method|Downsample Factor|\|z\||PSNR |SSIM |LPIPS |rFVD (Pexels)|rFVD (UCF-101)| |---------|---------------------|------------------|------------|--------------------|--------------|----------------|------------| |SD2.1-VAE|1\*8\*8|4|29.23|0.82|0.09|25.96|21.00| |SDXL-VAE|1\*8\*8|16|30.54|0.85|0.08|19.87|23.68| |OmniTokenizer|4\*8\*8|8|27.11|0.89|0.07|23.88|30.52| |OpenSora-1.2|4\*8\*8|16|30.72|0.85|0.11|60.88|67.52| |Cosmos Tokenizer|8\*8\*8|16|30.84|0.74|0.12|29.44|22.06| |Cosmos Tokenizer|8\*16\*16|16|28.14|0.65|0.18|77.87|119.37| |Reducio-VAE|4\*32\*32|16|35.88|0.94|0.05|17.88|65.17| ## Citation **BibTeX:** ``` @article{tian2024reducio, title={REDUCIO! Generating 1024*1024 Video within 16 Seconds using Extremely Compressed Motion Latents}, author={Tian, Rui and Dai, Qi and Bao, Jianmin and Qiu, Kai and Yang, Yifan and Luo, Chong and Wu, Zuxuan and Jiang, Yu-Gang}, journal={arXiv preprint arXiv:2411.13552}, year={2024} } ```