volcano-7b / README.md
Seongyun's picture
Update README.md
aea7c19
|
raw
history blame
1.89 kB
metadata
tags:
  - image-to-text
  - visual-question-answering
  - image-captioning
datasets:
  - kaist-ai/volcano-train
language:
  - en
pipeline_tag: image-to-text
library_name: transformers

Links for Reference

  • Repository:
  • Paper:

Overview

image/png Volcano employs a single LMM to generate initial responses, feedback, and revisions, as well as decisions to accept revisions. It follows a sequential procedure of an iterative critique-revision-decide loop.

Model details

Model type: Volcano-7b is a multimodal self-feedback guided revision model that was fine-tuned by mixing the visual instruction tuning dataset used in LLaVA-v1.5 with multimodal feedback and revision data collected through gpt-3.5-turbo, applied to the vicuna-7b-v1.5 model.

Model date: Volcano-7b was trained in October 2023.

Training dataset

  • 274K multimodal feedback and revision data
  • 558K filtered image-text pairs from LAION/CC/SBU, captioned by BLIP.
  • 158K GPT-generated multimodal instruction-following data.
  • 450K academic-task-oriented VQA data mixture.
  • 40K ShareGPT data

You can find here the dataset used to train Volcano, which includes all the aforementioned datasets.

Evaluation dataset

A collection of three multimodal hallucination benchmarks (MMHal-Bench, Pope, GAVIE) and two multimodal understanding benchmarks (MM-Vet, MMBench).