Papers
arxiv:2311.08172

Vision-Language Instruction Tuning: A Review and Analysis

Published on Nov 14, 2023
Authors:
,
,

Abstract

Instruction tuning is an essential supervised training phase for Large Language Models (LLMs), with the goal of enhancing LLMs' capacity to generalize instruction execution and adapt to user preferences. With the growing incorporation of multi-modal data into LLMs, there is an increasing interest in the performance of vision-language instruction tuning which presents more complex features in comparison to pure text instructions. In this paper, we systematically review the latest vision-language instruction tuning settings and datasets in multi-modal LLMs and summarize the characteristics that high-quality vision-language tuning data should have. We consider these characteristics as the foundational principles for constructing vision-language instruction data and propose a complete construction pipeline consisting of data collection, instruction generation, and quality control modules that incorporate meticulously designed instruction property evaluation indicators. We perform vision-language instruction tuning on three widely used multi-modal LLMs based on the instruction data we constructed and conduct extensive experiments on the corresponding metrics to demonstrate the rationality of the construction principles proposed in this paper. The code and dataset related to this paper have been open-sourced at https://github.com/palchenli/VL-Instruction-Tuning.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2311.08172 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2311.08172 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.