TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models Paper β’ 2410.23266 β’ Published 6 days ago β’ 19
Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss Paper β’ 2410.17243 β’ Published 14 days ago β’ 86
Stabilize the Latent Space for Image Autoregressive Modeling: A Unified Perspective Paper β’ 2410.12490 β’ Published 21 days ago β’ 8
The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio Paper β’ 2410.12787 β’ Published 20 days ago β’ 30
A Controlled Study on Long Context Extension and Generalization in LLMs Paper β’ 2409.12181 β’ Published Sep 18 β’ 43
SeaLLMs 3: Open Foundation and Chat Multilingual Large Language Models for Southeast Asian Languages Paper β’ 2407.19672 β’ Published Jul 29 β’ 54
3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination Paper β’ 2406.05132 β’ Published Jun 7 β’ 27
What If We Recaption Billions of Web Images with LLaMA-3? Paper β’ 2406.08478 β’ Published Jun 12 β’ 39
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs Paper β’ 2406.07476 β’ Published Jun 11 β’ 32
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization Paper β’ 2402.03161 β’ Published Feb 5 β’ 14
VideoPoet: A Large Language Model for Zero-Shot Video Generation Paper β’ 2312.14125 β’ Published Dec 21, 2023 β’ 44
Reasons to Reject? Aligning Language Models with Judgments Paper β’ 2312.14591 β’ Published Dec 22, 2023 β’ 17
Mamba: Linear-Time Sequence Modeling with Selective State Spaces Paper β’ 2312.00752 β’ Published Dec 1, 2023 β’ 138