microsoft/xclip-base-patch32 · Sparse sampling question

Hi all - great work here. I had one clarification on the sampling method used during training. The paper and card here says that "8 frames per video" were used. Are these frames sampled from one point in the video, consecutively and at a high frame rate (i.e. frames 21, 22, 23... 28), or taken from different points sparsely in the video (i.e. frames 1, 200, 400, 600...). In the cited works that use sparse sampling, it seems as though single frames are extracted from different salient "segments" of the video, but in the demo notebook here: https://github.com/NielsRogge/Transformers-Tutorials/blob/master/X-CLIP/Video_text_matching_with_X_CLIP.ipynb, frames are extracted at a high frame rate consecutively. I just want to make sure that at inference time, I'm sampling frames in the same way that the model was trained in. Thank you!