Models
Datasets
Spaces
Posts
Docs
Pricing
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2406.16852

Papers I want to read

Papers in my to-read list

RLHF Workflow: From Reward Modeling to Online RLHF

Paper • 2405.07863 • Published May 13 • 67
Chameleon: Mixed-Modal Early-Fusion Foundation Models

Paper • 2405.09818 • Published May 16 • 126
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models

Paper • 2405.15574 • Published May 24 • 53
An Introduction to Vision-Language Modeling

Paper • 2405.17247 • Published May 27 • 85

long-context-mllm

Visual Context Window Extension: A New Perspective for Long Video Understanding

Paper • 2409.20018 • Published Sep 30 • 8
LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture

Paper • 2409.02889 • Published Sep 4 • 54
Long Context Transfer from Language to Vision

Paper • 2406.16852 • Published Jun 24 • 32
lmms-lab/LongVA-7B-DPO

Text Generation • Updated Jun 26 • 4.42k • 7

interesting Video-LLMs

VoCo-LLaMA: Towards Vision Compression with Large Language Models

Paper • 2406.12275 • Published Jun 18 • 29
VILA: On Pre-training for Visual Language Models

Paper • 2312.07533 • Published Dec 12, 2023 • 20
LongVILA: Scaling Long-Context Visual Language Models for Long Videos

Paper • 2408.10188 • Published Aug 19 • 51
Long Context Transfer from Language to Vision

Paper • 2406.16852 • Published Jun 24 • 32

Multi-modality LVM

VoCo-LLaMA: Towards Vision Compression with Large Language Models

Paper • 2406.12275 • Published Jun 18 • 29
TroL: Traversal of Layers for Large Language and Vision Models

Paper • 2406.12246 • Published Jun 18 • 34
Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning

Paper • 2406.15334 • Published Jun 21 • 8
Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning

Paper • 2406.12742 • Published Jun 18 • 14

Long Context Transfer From Text To Vision: https://lmms-lab.github.io/posts/longva/

Long Context Transfer from Language to Vision

Paper • 2406.16852 • Published Jun 24 • 32
lmms-lab/LongVA-7B

Text Generation • Updated Jun 26 • 4.68k • 15
lmms-lab/LongVA-7B-DPO

Text Generation • Updated Jun 26 • 4.42k • 7
lmms-lab/v_niah_needles

Viewer • Updated Jun 15 • 5 • 770 • 3

Video Understanding

Vript: A Video Is Worth Thousands of Words

Paper • 2406.06040 • Published Jun 10 • 22
ShareGPT4Video: Improving Video Understanding and Generation with Better Captions

Paper • 2406.04325 • Published Jun 6 • 71
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

Paper • 2406.01574 • Published Jun 3 • 42
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Paper • 2405.21075 • Published May 31 • 18

iVideoGPT: Interactive VideoGPTs are Scalable World Models

Paper • 2405.15223 • Published May 24 • 12
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models

Paper • 2405.15574 • Published May 24 • 53
An Introduction to Vision-Language Modeling

Paper • 2405.17247 • Published May 27 • 85
Matryoshka Multimodal Models

Paper • 2405.17430 • Published May 27 • 30

Video as the New Language for Real-World Decision Making

Paper • 2402.17139 • Published Feb 27 • 18
Learning and Leveraging World Models in Visual Representation Learning

Paper • 2403.00504 • Published Mar 1 • 31
MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies

Paper • 2403.01422 • Published Mar 3 • 26
VideoElevator: Elevating Video Generation Quality with Versatile Text-to-Image Diffusion Models

Paper • 2403.05438 • Published Mar 8 • 18

about 17 hours ago

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

Paper • 2402.04252 • Published Feb 6 • 25
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models

Paper • 2402.03749 • Published Feb 6 • 12
ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Paper • 2402.04615 • Published Feb 7 • 38
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss

Paper • 2402.05008 • Published Feb 7 • 19

Daily paper that worth reading in details later

Neural Network Diffusion

Paper • 2402.13144 • Published Feb 20 • 94
Genie: Generative Interactive Environments

Paper • 2402.15391 • Published Feb 23 • 71
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

Paper • 2402.17177 • Published Feb 27 • 88
VisionLLaMA: A Unified LLaMA Interface for Vision Tasks

Paper • 2403.00522 • Published Mar 1 • 44

Previous
1
2
Next

Company

© Hugging Face

TOS Privacy About Jobs

Website

Models Datasets Spaces Pricing Docs