Sparse sampling question
Hi all - great work here. I had one clarification on the sampling method used during training. The paper and card here says that "8 frames per video" were used. Are these frames sampled from one point in the video, consecutively and at a high frame rate (i.e. frames 21, 22, 23... 28), or taken from different points sparsely in the video (i.e. frames 1, 200, 400, 600...). In the cited works that use sparse sampling, it seems as though single frames are extracted from different salient "segments" of the video, but in the demo notebook here: https://github.com/NielsRogge/Transformers-Tutorials/blob/master/X-CLIP/Video_text_matching_with_X_CLIP.ipynb, frames are extracted at a high frame rate consecutively. I just want to make sure that at inference time, I'm sampling frames in the same way that the model was trained in. Thank you!
A very good question that remains unanswered. The only thing to be found about this is the default value of a reference implementation's training args: https://github.com/xuguohai/X-CLIP/blob/6b5344f44537d758acb82d115b8484f7430f9fb0/main_xclip.py#L47
Did Microsoft use the default value when they trained the model? Nobody knows.