Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward
Abstract
Preference modeling techniques, such as direct preference optimization (DPO), has shown effective in enhancing the generalization abilities of large language model (LLM). However, in tasks involving video instruction-following, providing informative feedback, especially for detecting hallucinations in generated responses, remains a significant challenge. Previous studies have explored using large large multimodal models (LMMs) as reward models to guide preference modeling, but their ability to accurately assess the factuality of generated responses compared to corresponding videos has not been conclusively established. This paper introduces a novel framework that utilizes detailed video captions as a proxy of video content, enabling language models to incorporate this information as supporting evidence for scoring video Question Answering (QA) predictions. Our approach demonstrates robust alignment with OpenAI GPT-4V model's reward mechanism, which directly takes video frames as input. Furthermore, we show that applying this tailored reward through DPO significantly improves the performance of video LMMs on video QA tasks.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Aligning Modalities in Vision Large Language Models via Preference Fine-tuning (2024)
- Question-Instructed Visual Descriptions for Zero-Shot Video Question Answering (2024)
- Multi-modal preference alignment remedies regression of visual instruction tuning on language model (2024)
- Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback (2024)
- LVCHAT: Facilitating Long Video Comprehension (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 3
Datasets citing this paper 2
Spaces citing this paper 0
No Space linking this paper