Collections
Discover the best community collections!
Collections including paper arxiv:2406.07476
-
MotionLLM: Understanding Human Behaviors from Human Motions and Videos
Paper β’ 2405.20340 β’ Published β’ 19 -
Spectrally Pruned Gaussian Fields with Neural Compensation
Paper β’ 2405.00676 β’ Published β’ 8 -
Paint by Inpaint: Learning to Add Image Objects by Removing Them First
Paper β’ 2404.18212 β’ Published β’ 27 -
LoRA Land: 310 Fine-tuned LLMs that Rival GPT-4, A Technical Report
Paper β’ 2405.00732 β’ Published β’ 118
-
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
Paper β’ 2406.07476 β’ Published β’ 32 -
Improving Retrieval Augmented Language Model with Self-Reasoning
Paper β’ 2407.19813 β’ Published β’ 6 -
Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents
Paper β’ 2408.07199 β’ Published β’ 20
-
iVideoGPT: Interactive VideoGPTs are Scalable World Models
Paper β’ 2405.15223 β’ Published β’ 12 -
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models
Paper β’ 2405.15574 β’ Published β’ 53 -
An Introduction to Vision-Language Modeling
Paper β’ 2405.17247 β’ Published β’ 85 -
Matryoshka Multimodal Models
Paper β’ 2405.17430 β’ Published β’ 30
-
Video as the New Language for Real-World Decision Making
Paper β’ 2402.17139 β’ Published β’ 18 -
Learning and Leveraging World Models in Visual Representation Learning
Paper β’ 2403.00504 β’ Published β’ 31 -
MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies
Paper β’ 2403.01422 β’ Published β’ 26 -
VideoElevator: Elevating Video Generation Quality with Versatile Text-to-Image Diffusion Models
Paper β’ 2403.05438 β’ Published β’ 18
-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper β’ 2402.04252 β’ Published β’ 25 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper β’ 2402.03749 β’ Published β’ 12 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper β’ 2402.04615 β’ Published β’ 38 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper β’ 2402.05008 β’ Published β’ 19
-
World Model on Million-Length Video And Language With RingAttention
Paper β’ 2402.08268 β’ Published β’ 36 -
Improving Text Embeddings with Large Language Models
Paper β’ 2401.00368 β’ Published β’ 79 -
Chain-of-Thought Reasoning Without Prompting
Paper β’ 2402.10200 β’ Published β’ 99 -
FiT: Flexible Vision Transformer for Diffusion Model
Paper β’ 2402.12376 β’ Published β’ 48
-
PRDP: Proximal Reward Difference Prediction for Large-Scale Reward Finetuning of Diffusion Models
Paper β’ 2402.08714 β’ Published β’ 10 -
Data Engineering for Scaling Language Models to 128K Context
Paper β’ 2402.10171 β’ Published β’ 21 -
RLVF: Learning from Verbal Feedback without Overgeneralization
Paper β’ 2402.10893 β’ Published β’ 10 -
Coercing LLMs to do and reveal (almost) anything
Paper β’ 2402.14020 β’ Published β’ 12