Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models Paper • 2410.02740 • Published 13 days ago • 51
LVD-2M: A Long-take Video Dataset with Temporally Dense Captions Paper • 2410.10816 • Published 2 days ago • 11