cerspense/zeroscope_v2_576w · Difference between this model's VAE and modelscope's VQGAN

I've been trying to run this model both with the Huggingface API and A1111, and I've noticed that the A1111 text2video extension requires the original modelscope VQGAN autoencoder file to be downloaded (5.21 GB here: https://huggingface.co/damo-vilab/modelscope-damo-text-to-video-synthesis/tree/main), but this zeroscope model has a much smaller VAE file (167 MB here: https://huggingface.co/cerspense/zeroscope_v2_576w/tree/main/vae). Are these model files equivalent, despite their size difference? If not, does it mean that the A1111 pipeline is different from the Huggingface API pipeline in terms of computation and results, or is the VQGAN autoencoder file of the original modelscope not actually used in A1111?