PG-Video-LLaVA: Pixel Grounding Large Video-Language Models
Abstract
Extending image-based Large Multimodal Models (LMM) to videos is challenging due to the inherent complexity of video data. The recent approaches extending image-based LMM to videos either lack the grounding capabilities (e.g., VideoChat, Video-ChatGPT, Video-LLaMA) or do not utilize the audio-signals for better video understanding (e.g., Video-ChatGPT). Addressing these gaps, we propose Video-LLaVA, the first LMM with pixel-level grounding capability, integrating audio cues by transcribing them into text to enrich video-context understanding. Our framework uses an off-the-shelf tracker and a novel grounding module, enabling it to spatially and temporally localize objects in videos following user instructions. We evaluate Video-LLaVA using video-based generative and question-answering benchmarks and introduce new benchmarks specifically designed to measure prompt-based object grounding performance in videos. Further, we propose the use of Vicuna over GPT-3.5, as utilized in Video-ChatGPT, for video-based conversation benchmarking, ensuring reproducibility of results which is a concern with the proprietary nature of GPT-3.5. Our framework builds on SoTA image-based LLaVA model and extends its advantages to the video domain, delivering promising gains on video-based conversation and grounding tasks. Project Page: https://github.com/mbzuai-oryx/Video-LLaVA
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- VLM-Eval: A General Evaluation on Video Large Language Models (2023)
- Vamos: Versatile Action Models for Video Understanding (2023)
- Video-LLaVA: Learning United Visual Representation by Alignment Before Projection (2023)
- Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models (2023)
- MM-VID: Advancing Video Understanding with GPT-4V(ision) (2023)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- VTimeLLM: Empower LLM to Grasp Video Moments (2023)
- VLM-Eval: A General Evaluation on Video Large Language Models (2023)
- Vamos: Versatile Action Models for Video Understanding (2023)
- Video-LLaVA: Learning United Visual Representation by Alignment Before Projection (2023)
- Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models (2023)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
Is there any online demo of this paper?
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper