Video Inference or training
#4
by
dosun
- opened
I have tried to run this model for video captioning. However, it only returns a caption for each frame. In the original paper, the model supports video through multiple frames. Is this support at HuggingFace as well?
Hi,
For video captioning I'd recommend taking a look at the GIT checkpoints fine-tuned on video datasets, like https://huggingface.co/microsoft/git-base-vatex