-
NVLM: Open Frontier-Class Multimodal LLMs
Paper • 2409.11402 • Published • 71 -
BRAVE: Broadening the visual encoding of vision-language models
Paper • 2404.07204 • Published • 18 -
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models
Paper • 2403.18814 • Published • 44 -
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models
Paper • 2409.17146 • Published • 99
Collections
Discover the best community collections!
Collections including paper arxiv:2407.01449
-
RLHF Workflow: From Reward Modeling to Online RLHF
Paper • 2405.07863 • Published • 67 -
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Paper • 2405.09818 • Published • 126 -
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models
Paper • 2405.15574 • Published • 53 -
An Introduction to Vision-Language Modeling
Paper • 2405.17247 • Published • 85
-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 25 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 12 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 38 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 19
-
MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts
Paper • 2407.21770 • Published • 22 -
VILA^2: VILA Augmented VILA
Paper • 2407.17453 • Published • 38 -
The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective
Paper • 2407.08583 • Published • 10 -
Vision language models are blind
Paper • 2407.06581 • Published • 82