matlok 's Collections
LMM

Papers - Video - Encoders - C-ViViT

The embeddings of images and video patches from raw frames x are processed by a spatial and then a causal transformer (AR in time) to gen video tokens