Model is even worse than InternVL-1.5 for multiple images

#6
by pseudotensor - opened

E.g. send 3 images, one of birthday cake, another receipt, and another bigben. If asked "What tower do you see?" one gets: "The image does not show any tower. It shows a receipt from a shopping store and a cake with a congratulatory message."

OpenGVLab org

Thank you for your feedback. Recently, I have been working on a quantitative assessment of the model's multi-image performance.

czczup changed discussion status to closed

Sign up or log in to comment