Spaces:
Running
on
CPU Upgrade
Discrepancy between listed and own accuracy of LLaVA-Onevision-7b-ov on BLINK benchmark
First of all, your leaderboard is mega helpful.
I am currently experimenting with LLaVA-OneVision on the BLINK Benchmark. I took a closer look at the 0.5b-ov and 7b-ov of the latest transformers version on the BLINK benchmark from Huggingface.
Here I explicitly evaluated the subtasks: Visual_Correspondence, Visual_Similarity, Jigsaw and Multi-view_Reasoning on their test data by uploading my predictions to the BLINK-Benchmark Evaluation Challenge page.
I then compared the results with those on your leaderboard. Here are my results with a precision of fp16 for 7b-ov:
{
"test": {
"Visual_Similarity": 0.4632352941176471,
"Counting": 0,
"Relative_Depth": 0,
"Jigsaw": 0.5533333333333333,
"Art_Style": 0,
"Functional_Correspondence": 0,
"Semantic_Correspondence": 0,
"Spatial_Relation": 0,
"Object_Localization": 0,
"Visual_Correspondence": 0.27325581395348836,
"Multi-view_Reasoning": 0.5714285714285714,
"Relative_Reflectance": 0,
"Forensic_Detection": 0,
"IQ_Test": 0,
"Total": 0.1329466437737886
}
}
The leaderboard shows:
- "Visual_Similarity": 80,
- "Jigsaw": 62.7,
- "Visual_Correspondence": 47.7,
- "Multi-view_Reasoning": 54.1
And for the 0.5b-ov version i've got with a precision of fp16:
{
"test": {
"Visual_Similarity": 0.4632352941176471,
"Counting": 0,
"Relative_Depth": 0,
"Jigsaw": 0.56,
"Art_Style": 0,
"Functional_Correspondence": 0,
"Semantic_Correspondence": 0,
"Spatial_Relation": 0,
"Object_Localization": 0,
"Visual_Correspondence": 0.3023255813953488,
"Multi-view_Reasoning": 0.47368421052631576,
"Relative_Reflectance": 0,
"Forensic_Detection": 0,
"IQ_Test": 0,
"Total": 0.12851750614566512
}
}
These correspond to the accuracies of the leaderboard:
- "Visual_Similarity": 47.4,
- "Jigsaw": 52.7,
- "Visual_Correspondence": 28.5,
- "Multi-view_Reasoning": 45.1
What I also noticed is that the results of the leaderboard on overall accuracy do not match the results from the paper.
Paper (Table 4):
- LLaVA-OV-0.5B: 52.1
- LLaVA-OV-7B: 48.2
Leaderboard (Overall):
- LLaVA-OV-0.5B: 40.1
- LLaVA-OV-7B: 53
I don't want to rule out the possibility that this is a mistake on my part. However, it is very conspicuous.
Hi,
@nicokossmann
,
Currently, VLMEvalKit supports the evaluation of BLINK VAL split, and we have released all VLM predictions in this huggingface dataset:
https://huggingface.co/datasets/VLMEval/OpenVLMRecords/tree/main.
You can build submission files based on our released prediction and try to submit to the official evaluation site again.
Also, I feel that the results released in the llava-onevision paper are weird: 0.5B worse than 7B on BLINK? I don't think that makes much sense.
Thank you for your response
@KennyUTC
.
I have compared my results of the 7b model with yours. I have drawn the confusion matrices based on the multiple choice answers of the val set.
Once based on my predictions (first) and yours (second) for the subtasks described above, where the accuracy of each subtask matches the value of your leaderboard.
For Jigsaw:
I wanted to ask if you used the Github repo model or the Huggingface checkpoint for evaluation. Furthermore, I wanted to ask if you could evaluate the Hugging Face checkpoint (used for my predictions) on your side to check if it is a systematic error on my side.