SpaceVLMs
Collection
Features VLMs fine-tuned for enhanced spatial reasoning using a synthetic data pipeline similar to Spatial VLM.
•
9 items
•
Updated
•
4
SpaceLLaVA-lite fine-tunes MobileVLM on a dataset designed with VQASynth to enhance spatial reasoning as in SpatialVLM
This model uses data synthesis techniques and publically available models to reproduce the work described in SpatialVLM to enhance the spatial reasoning of multimodal models. With a pipeline of expert models, we can infer spatial relationships between objects in a scene to create VQA dataset for spatial reasoning.
Use this model to query spatial relationships between objects in a scene.
Run it using MobileVLM inference code:
# assuming cwd is /path/to/MobileVLM/
from scripts.inference import inference_once
model_path = "/path/to/SpaceLLaVA-lite"
image_file = "/path/to/your-image.jpg"
prompt_str = "For each object in the scene, describe the distance between objects in meters"
args = type('Args', (), {
"model_path": model_path,
"image_file": image_file,
"prompt": prompt_str,
"conv_mode": "v1",
"temperature": 0,
"top_p": None,
"num_beams": 1,
"max_new_tokens": 512,
"load_8bit": False,
"load_4bit": False,
})()
inference_once(args)
Try it on Discord: http://discord.gg/b2yGuCNpuC
@article{chen2024spatialvlm,
title = {SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities},
author = {Chen, Boyuan and Xu, Zhuo and Kirmani, Sean and Ichter, Brian and Driess, Danny and Florence, Pete and Sadigh, Dorsa and Guibas, Leonidas and Xia, Fei},
journal = {arXiv preprint arXiv:2401.12168},
year = {2024},
url = {https://arxiv.org/abs/2401.12168},
}
@article{chu2023mobilevlm,
title={Mobilevlm: A fast, reproducible and strong vision language assistant for mobile devices},
author={Chu, Xiangxiang and Qiao, Limeng and Lin, Xinyang and Xu, Shuang and Yang, Yang and Hu, Yiming and Wei, Fei and Zhang, Xinyu and Zhang, Bo and Wei, Xiaolin and others},
journal={arXiv preprint arXiv:2312.16886},
year={2023}
}
@article{chu2024mobilevlm,
title={MobileVLM V2: Faster and Stronger Baseline for Vision Language Model},
author={Chu, Xiangxiang and Qiao, Limeng and Zhang, Xinyu and Xu, Shuang and Wei, Fei and Yang, Yang and Sun, Xiaofei and Hu, Yiming and Lin, Xinyang and Zhang, Bo and others},
journal={arXiv preprint arXiv:2402.03766},
year={2024}
}